
rile Measurement of Groups 
and Series: 


A COURSE OF LECTURES 

V> \ 

A. L. BOWLEY, M.A., F.S.S., 

A!'i‘OINlE:< *ZA'':K'.L “’F :^TAiISTIC^ AT THE ^CHOOL OF ECONOMIC:> 
T.VIVESSnV OF LONI^ON). 


DELIVERED AT THE 

Institute of Actuaries, Staple Inn Hall, 

During the Session 1902-1903. 


LONDON: 

CHARLES AND EDWIN LAYTON 

55, FARRINCxDON STREET, E.C 
1903. 




lU 


PREFACE. 


The present course of Lectures on the Measurement of 
G-roups and Series deals with some of the most modem 
methods of statistical research. Interesting as they were to 
those who had the advantage of hearing them delivered, 
they will doubtless^ when studied at leisure in printed form, 
prove even more interesting and useful. 

These Lectures are the fifth of a Series originated in 
1897, designed for the assistance of Actuarial Students in 
connection with matters not included in the oflScial Text 
Books. Three of the Series deal with legal matters, and 
one with the subject of Stock Exchange Securities. The 
present course carries the range of topics into the fi.eld of 
mathematics, and it is hoped that courses of lectures may 
be hereafter provided dealing with other subjects, practical 
and theoretical, relating to those branches of knowledge 
which it is the province of the Institute of Actuaries to 
promote and encourage. 


W. H. 




24dh February, 1903 . 




TABLE UF CONTEXTS. 


MEXT UF GeoUPS 



PAGE 

1 

•apMc llethod 



... 3 

istogi-am and Ogive 



7 

'erages 



... 11 

andard Deviation and 3Iodnliis ... 



... 19 

\-rage Deviation ... 



... 24 

obable Error 



... 34 

Lrement of Skewness 



.. 36 

le Curve o£ Error 



... ;31 

le Method of Least Squares 



... 45 

tting FormuliE to Observations ... 



... 48 

;es of the Curve of Enw 



... 53 

ruction of a Group from Samples 



... 56 

srrelation between Two Groups ... 



... 61 

le Coefficient of Correlation 



... 64 

stifieation of the Formula 



... 72 

MENT OF Series 



... 74 

issilieation 



... 75 

riodic Curves 



... 75 

mptomatic Series 



... 77 

rrelation between Series... 

... 


... 82 

on of Significance of the Correlation Coefficient 

... 88 

nclusion ... 



... 90 







XoTE.— Ill iiir- iet*tiire> I have mtule free use of those 

>tati'!i4.4l metliecls aiiil feiii'iiihi?, wliieli ithougli in many cases of recent 
lurigin may new |-erbaps be re^artlrd as eoiuiiioii |fro|>erty; but I liope 
tliat I littve sot iiiadTertently quoted witiioiit reference, or misquoted, 
incr-ti^atiuBs or t!'ie(sries. mdiicli may be regarded as personal to any of the 
small liody of statisticians wctrking on the subject treated. Tlie lectures 
liatl to W prepared butli for delivery and for tiie Press at short notice in tlie 
laiiist of a busy session. This fact must be my apology for any obscnreness. 
iinnecessiTT repetition. i»r el ii in sin ess of arrangement or expression, wliicb 
may fouBiI. 


A. L. BOWLEY. 



LSUREMEXT OF GROUPS. 


FIRST LECTURE. 


lEX;, it was with, considerable diffidence that I 
to lecture to members of the Society of Actuaries 
ct with which they may be presumed to be so 
^Yhen I was asked if I could undertake these 
ad some difficulty in choosing a suitable subject, 
occurred to me that my audience were probably 
with the practical aspects of a question which I 
considering from the theoretical point of 'vnew^, and 
id therefore be most suitable if I endeavoured to 
them some theoretical considerations on subjects 
Lot come in their ordinary course, but which were 
3 subjects which naturally come before them, and 
illied to those subjects on which I have spent a 
unt of time and attention. 

Geoups. 

} subject which I have selected is the measurement 
the characteristics of a group, and its representa- 
group I understand a number of persons or things 
bich possesses a measurable characteristic, the 
g arranged according to the magnitude of the 
ic. Eor example, if I have returns of the wages 
umber of people, and I group them according to 
, saying how many are earning 205. to 255., and 
ball have such a group ; or if I choose a section of 
ion and group them according to ages, I should 
3r group of the kind I am thinking of. The 

B 



. :*;> 1 Lnvr make aliMiit groups will, I ]io| 

T apply |m a very large range of gr« 

1 t r - -av^^aienee iriii>traiiMii I sliall confine iiiys^ 
•a’y u -mall iiinrJ er. 

Tie. partiiiilur L^r ap I am taking for discussion 
t v-ij!!;; taken frni trie current Census, tlie iiimibfc 
:aurr> w. lie.ni in tke CMiinrj of York, on Census day, 
.111 ed aec'.nling tlieir ages. In selecting a groii 
{i> Ti it mii-t be large enough and small enougi 
irm-t b«^ >:ii!ie!tmt!y large to conceal individual peciilia' 
^ r 1 eenkaritie- of siiia!! >ectioiis; it must be sufficiently 
! ]‘e bcii>so>eiieoii^. Both these limits are relative. A| 
Tim! i- Ic.rge enoiigii fur one purpose is too large for an 
] iirp ‘** 0 ; and a gronpi tliat is liomogeiieous for one is hi 
for another. The death rate of a whole country 
be saffirit-iii for certain comparisons, but for other compai 
you must sabclivide according to districts and age. Th( 
I licit lias been selected mast be kept in view before 
argiiineiits are based on the grouping and its measureme 
There are two main divisions of groups : those tha 
clerive«! from exact observations, and those which ms 
regarded as samples of a larger group the whole of whic’ 
not been measured. For purposes of reference I am C2 
them Group a, ivheii the observations are supposed 1 
correct, as, for example, the number of persons who a 
receipt of a certain income : and Group jS, where the nur 
are estimates; for example, an estimate of the numb« 
persons who may be expected to be in receipt of C€ 
incomes ten years hence, from an investigation of some ^ 
now, or at some previous time. As regards Group a 
chief work will be to select some method of abbrevia 
of describing in brief, each group; in the case of Groi 
our work will be chiefly^ to criticise the correctness o' 
statements, and to find methods which are prc 
applicable for its correction if it is not exact, to measui 
precision, and then afterwards to select some suitable me 
of abbreviating it. 


The Geaphic Method. 

The two chief methods of abbreviating or investigj 
the characteristics of a group are the graphic method 
the method of averages. The method of averages si 


|iirliap> referred tu first ; biit^ since the ii.-e of diagrams in 
explaining tlie meaning of averages is very considerable;, I 
have thought it better to take the method of dia,grams first. 
I have drawn out, in four different ivays, the group already 
named, the number of married w^omen in the county of 


A^es of Wii'cs present tcifk their Hushands in the Registration 
Oountg of Yorh^ 1901. 


Between- 


No. per IjCKW. 

Xot more than— 

Per 1,000. 

15 and 16 vears 

•01 

16 years old 

•01 

16 „ 

17 „ 

*03 

17 

•04 

17 „ 

18 „ 

•2^5 

18 

•26 

18 „ 

19 „ 

1-2 I 

19 

1*5 

19 „ 

20 „ 

3*4 J 

20 

5 

20 „ 

21 „ 

S \ 83 

7o > 

21 

13 

21 „ 

25 „ 

25 „ 

, 88 

25 „ 

30 „ 

15'7 

30 „ 

, 245 

30 „ 

So ,, 

162 

35 „ 

407 

35 „ 

40 „ 

147 

40 

' 554 

40 „ 

45 „ 

125 

45 

! 679 

45 „ 

50 „ 

105 

50 

; 784 

50 „ 

55 „ 

80 

55 

1 864 

55 „ 

60 „ 

55 

60 

' 919 

60 „ 

65 „ 

40 

65 „ 

959 

65 „ 

70 „ 

22 

70 

1 981 

70 „ 

75 ,, 

14 

75 „ 

1 995 

75 „ 
Above 

80 „ 

80 „ 

4 

1 

1,000 

80 „ 

; 999 

1 


Total nuial)er included, 610,505. 


There are shown in Diagrams I to IV_, the numbers of married 
women in that county per thousand betiveen these ages. The 
total of wives in the county of York living with their husbands 
was 610,0CK) odd. As is usual, the numbers are divided in years 
between the ages of 15 to 21, and after that in five-yearly 
groups. The first method of representing figures by diagram 
is to place a dot in a given vertical position for each person 
or item in question. This is indicated in Diagram I. The 
method is not very important and is perfectly obvious. I 
should only use it as a means of passing to another, if it were 

JB 2 



—Point Diigeam. 


Pit 

Mi7/£ 



years ' 2S 25 3i 35 10 15 50 55 60 65 70 75 80 SG 

nut tliat ill tlitue classes of measurement wliere the quantitie 
separated liy a finite interval it is incorrect to use the met 
sEomui in Diagi*ams II, III and IV. If one was enti 
the iiiimhcr of houses at particular rents in a town^ whc 
might perhaps be supposed that the rent always jumped 1 
much as £2^ one could represent properly the nuiiib^ 
houses at each £2 mark, but there would be no house a 
intermediate intervals, and it would be incorrect to prc 
any further to such a curve as would lead one to suppose 
the quantity dealt with was continuous. To take an( 
example, the railway service from one town to another n 
be represented by a series of dots placed vertically ove’ 
time taken by the train, measured horizontally, hut not b 
following methods. If, how’ever, the quantity is capab 
continuous variation, such as age or height, or if h}^ a s 
extension of the meaning it may be regarded as being ca; 
©f continuous variation, such as income, we may procei 
the method of Diagrain II. 

In Diagram II rectangles are drawn whose height 
the same as the height of corresjK)nding lines of dc 



feurs 15 20 25 30 35 40 45 50 55 60 85 70 75 80 85 

am I, but tlie breadth, is the unit of abscissm^ in this 
ve years. The areas can be regarded as representing a 
3r of persons. The area of the whole space enclosed 
e outer lines of the rectangles is on the scale chosen^ 
bre inches, which represents the whole of the population 
ered; the breadth of each rectangle* is -J- inch, and 
squared represents 1 per cent. 

fore we can go any further we have to make some 
tion as to the distribution of persons within the five- 
intervals selected. Even in my class, (a), when the 
rations are known to be correct, some assumption must 
de as to distribution before proceeding further. If, for 
fie, the correct set of measurements of the heights of a 
ent was given, every soldier being measured correctly 
nearest ^ of an inch, no correction would be required 
tual mistakes, but before a continuous curve could be 
L passing from one ^ inch to the next, some assumption 
be made as to distribution of heights, e.g.^ that pro- 
on was uniform between the given points. In the case 
* The areas for ages helow 25 are shown in more detail. 


f ■ii,.-'erv*itiiJiis (3) wliicli are only samples^ it is still 
arv !u make -'jine assumption as to coiitiiiiiitj 
!'■' 0 wliat assaiiiptioii is proper^ in the cj 

ivrnei.iher that the facts given exactly ar 
per>^ai.s wlicse ages lie between certain 1: 
tLi*.' b. art^ triveii the area of the rectangle^ or c 
bimr,’ wliicli replact/s the rectangle on each unit of 
Wliat we Inive to -irppose is that the ages are subdivide 
merely into vears^ but into infinitesimal units of time: 
we have to make some assumption for guiding us in p£ 
from one of the given positions tu the next. There are a 
j)oskkm> which give definitely the number of persons 
*iti yean-j below 25 years,, and so forth. "We have to fir 
iiiiiiiljer of persons below 23^ or any other assigned age. 
is a fpiite familiar idea; but there are one or two thin 
comieetioii with it -which it is necessary to point out. 

The Histogeam axd the Ogive. 

If straight lines are drawn from the middle of 
horizontal line in Diagram II to the middle of the ne 
get the dotted line in Diagram III (called a histog 

III.— Histogeam. 



/ 


Tliat is certain to be incorrect on two grounds. In tlie first 
pkcCj the area bounded by the lines nearest the highest point 
is necessarily too siiiallj for part of the area between the ages 

and 35 is cut off by the dotted line^ and nothing is placed 
instead of it. Before pointing out the other way in which 
the histogram is necessarily incorrect^ we will pass on to 
Diagram IT, which is thus constructed. At the ages sho^m 
on the horizontal axis are dra’^xn rectangles proportional to the 
number of persons below that age; then we get a continually 
ascending figure called an ogive, which is given as absolutely 
correct in group a, the points at the corners of the steps 
obtained in the figure being given by assumption. The 
problem conies to be to draw some line or curve from these 
fixed points that shall satisfy the conditions which we must 
assign. Xow, it would be necessarily wrong to join these 
successive points by straight lines. If we take three corners, 
A, B, C, not in a straight line, we get a sharp angle at B. 
Introducing sharp angles there necessarily involves an error, 
for they indicate discontinuity at certain arbitrary points, 
which can correspond to no facts in nature. The angles 
obtained in the histogram are erroneous for similar reasons. 
If, as in group a, we are to suppose the observations to be 
correct, a continuous line must be drawn through all the given 
points which has no sharp angles in it, no sharp change 
of curvature. If, as in group /?, we are not bound to assume 
that the observations are correct, the line may he drawn not 
passing through the points, hut near them. Many groups 
may be represented with sufficient accuracy in rough work by 
drawing a freehand curve passing through the given points ; 
it will be found that there is very little margin for drawing 
such a curve, if the rule is made that the curvature is never 
to be greater than necessary, that the direction is not changed 
more rapidly than is necessary to pass through the points. 
This condition, stated in mathematical language, supplies the 
main problem of interpolation. 

In group ^ we are not hound to assume that the curve 
passes through all the points, and the question which is the 
best curve (drawn freehand or otherwise) near the points, 
needs the theory of probability for its discussion. 

Inteepolation Cueve. 

As regards group a, to which discussion may he confined 
for the present, where the curve is to pass through all the 



o 



poiatSj I suggest the familiar method of interpolation by a para- 
l»lic formnla. Take the equation f=%+%»+ 0 ^+ 03 ®®+ . . 
contimied to as many terms as are convenient. In the group 






9 


iiiider discussion it would be inexpedient to take more tban 

4 or 5 terms, because we are fitting a definite algebraic curve 
to irregular observations, and tbe law wbicb underlies tbe 
observations may very well change if we take a larger 
period than 25 years. I have confined the work for this group 
to 5 terms, continuing that series to a?**. 

Consider what conditions that curve satisfies. Stopping at 
the second term we have a straight line, which can only be 
made to pass through two points. So we have to start afresh 
at the second point, and thereby contradict the first assumption 
which we make, that the increase is not subject to violent 
changes, introducing an angle at any point. Introducing the 
second term r- we have a parabola, which can be made to pass 
through three points ; the curve has continuous curvature, the 
third differential, and the third differences obtained from the 
values of y at three consecutive equidistant values of a’, 
vanish; that is to say, there is no sudden change in curvature. 
The first differential measures the inclination of the line; the 
second measures the change of inclination, and if that is 
constant there is a constant change of inclination, but no 
sudden break. But we have no reason for assuming a constant 
change of inclination, and the curve which passes through 
three assigned points ^vill not in general pass through the next 
point. We then proceed to include further terms. If we take 
the equation up to d? we can introduce a point of inflection, 
which we cannot do with the parabola. If we take the 
equation a step further we introduce two points of inflection, 
and it is unnecessary to go as far as the fifth term in a 
diagram like this. If we take 6 terms, the 6th differential and 
the 6th difference vanish, the 5th difference is constant, and 
there is no sudden break, and so on. 

ISTow take the equation as far as the term We have 

5 unknowns, and can determine them by assigning the 
condition that the curve .shall pass exactly through 5 points 
on the diagram in question. I have calculated the equation 
of the curve which will pass exactly through the 5 points of 
the ogive, corresponding to 20, 25, 30, 35 and 40 years. The 
method of calculation is a good deal facilitated by the use of 
finite differences. Eefer as origin for the abscissae to age 20^ 
take 5 years as unit, so that aj is 1 at age 25, and write down 
the equations which naturally arise, taking the numbers from 
the last column of the table on p. 3. 


«l —. (I.Q 

= f|,y -f ri, -f «2 4- i- «4 

245 ■= 4- fli. - -r • 2“ ^ flj. 2-^ 4- (h . 2^ 

4u7 = iio 4- ?i| .3 4- %.3“ 4' 4- 

4 = fi,j 4- 1. 4 4“ «2 • 4*-^ -h fiiA^. 

It ’L< easily >Iiowii tliat (14 = Af.-^ 24 ^, (?3=A;;h- 6 —Afj-t- 4 ^ 
G2 = A'-^-2"-A‘~2 + I-‘- of A,\ aiiil and % are then easily 
eaieiilateiL In tin's ease A.*=49, A;: =—09, A“ = 74, a 4 = 2^y 

a3=:—2*>4, fi2=9d^-^, Gi = 10|. 

Let .:=/ ■?’) be tlie ef|iiarioii of the curve replacing the 
lii>togr€ain, and ^ = F the equation of the ogive. Then 

i|=j r.il,/, and Bj means of these equations, the 

paraliolie or siiiootlied carve now obtained from Diagram IT 
can be ii^ed tu fariiisli values to replace the histogram of 
Diagram III. The same unit for x (five years) must, of 
course, be used in both cases. Thus, if ir=l (25 years), 

r = '4-' = „ 4 . 2a.,. 1 + 803 . P + 4^4. P = 135-6. 

ax 

At SO years, ::= 166*9; at 35 years, 158*8. 

The curve obtained in this way is shown in the continuous 
line in Diagram III; this curve satisfies the conditions that 
the areas standing on the 5"3*ear bases, from 20 years to 
45 years, should represent on the cdiosen scale the number of 
persons given by the original table, and that there should be 
no abrupt changes of curvature. 

Since the curve has been chosen so as to satisfy the 
conditions for only five age periods, it will not necessarily 
satisfy any more; but in this case the curve merges into 
a straight line, which approximately fulfils the conditions till 
65 years. If we need greater accuracy in later years, we 
should calculate new values for the a^s and obtain a second 
carve, satisfying a new group of area conditions. If we 
needed to draw the whole curve accurately, we should have 
to devise a method of passing without a break of continuity 
from one such parabolic curve to the next; but, as it is, we 
only want means of obtaining specified points on the curve, 
and that can be done by choosing the special parabolic curve 
that is in the neighbourhood of the required points. 




Averages. 

Ihe Mode.—A t tlic liigliCht uf tlicf sinootli viiTvt in 

Diagram III, = 0 ; lienee ^ = 0 in the ogive f^r the same 

value of a.’. Tims, as is otherwise evidtiii, tlie ogive i> 
steepest and tliere is a point of iiiflexirn, at tliat value of j 
wliicli gives the greatest ordinate in Diagram III. 

If 0=2a2-f 6a3.r4-12«4;f2^ and 

£=(—8 £13 + y 9(4 —' 24{ua4l -e-12a4y 

where a 2 y flsj ^4 are given in terms of the differences above. 
Writing in these values, we have a’ = 2*021, which corresponds 
to 30*10 years. 

If we had included a further term, we should have 
a cubic to solve to determine x. If we had only gone as 
far as we should have the equation 0 = a 2 -+ 3 a 3 r; that 
is, cr= 1—but this formula is unsafe, unless the 
fourth differences of the original figures are approximately 
zero. 

The equation taken as far as the term appears to me 
to be practically the best in the example we are discussing. 
If we choose the coefficients to satisfy the conditions starting 
from the age 25, we obtain 30*43 years as the position of 
the highest point. The discrepancy between this and the 

30*10 years found from the parabola starting from the age 

20 years, arises from the indeterniinateiiess of the original 
figures. It seems best to take the value from the former 
curve, as the point then lies near the middle of the assigned 
values. We adopt then the age 30*10 years as the required 
age, and find that ^=166*94 at that age. The age so found 
is called the mode of the group; it is also called the position 
of greatest density and of the maximum ordinate. 

The Median. —^The abscissa of the point (M) where 
the ogive is ent by the horizontal line half way up the 

scale (from 0 to 1,000) is called the median. In the 

histogram, or the smoothed curve which replaces it, the 
vertical through the median divides the curve into equal 
areas. When the ogive is drawn, the median can at once 
be found graphically. To find it algebraically, take a 
parabolic equation as before, satisfied by five points lying 
near the median, obtain the coefficients as before, and put 



.,/ = .”un. fine of tIio nf tlie €411^11011 <0 obtained is the 

liieiiiaii. Starting* at *Ih years we Imve— 

2-lo - 1 ti4 ‘ T 4-*^* — 4 - 4 = 500, 

and, *^olviriLr hv Horner's iiietliotl, ..i’=l’623, so tliat tlie 
iiieiliaii age :]iJ -5 . = 38*11 ijearsj. 

1 lie aiv tlie abscitsie of the points (Qi, Qo) 

wliere the liorizontal lines one-fiiiarter and three-quarters up 
tlie Si'iile from it to cut the ogive. The vertical 

iliroiiirli tlie.-e ab>eis.-ie in Diagram III would, together 
with tlir mediaii vertical, divide the area into four equal 
par!>. The quartiles can be found from one of the ec|uatio 2 is 
already written by putting ^ = 250 and 750 successively, 
and S4jlviiii£ for ,/ ; or they can be found graphically. 

A rough inetliod of finding these points, often sufficiently 
aceiirattq and saving a more laborious solution, is to assume 
that ilie parts of the ogive between the corners which 
eoiitaiii the iiiediaii are straight lines. There are 407 (per 
tlifiiisandj below 35 years; 93 out of the 35 to 40 year 
group, which contains 147, are to be taken to reach the 
luedian, which is on the hypothesis of a straight line 
(S5 + -^4V "b J^sirs, that is 38*17 years, a value differing 
little from that already obtained. The lower quartile by 
either method is 30*10 years, the upper 48*6 years. 

We have now the followdiig figures :— 

Lower quartile ... 30*16 years. 

Mode... .. 30*10 „ 

Median . 38*11 „ 

Arithmetic average... 40*111 ,, 

Upper quartile ... 48*6 „ 

The arithmetic average is calculated directly in the 
ordinary way, but is of little importance in such a group 
as this. 

If a person is taken at random from this group, her 
most probable age is 30*1 years, the mode. It is as likely as 
not that she will be over 38*11 years, the median. It is as 
likely as not that she will be between 30*16 and 48*6 years. 
The chances are 3 to 1 against her being less than 30*16 
years ; 3 to 1 against her being over 48*6 years. 

Other points can be obtained by dividing the group into 
ten equal parts, or one hundred equal parts. These are called 
the and percentiles respectively. 



X isiuiAr iii tir', ux: 

special importance^ as being ilie iiio>t probable valae. It 
is entirely unaffected by tlie extreiiie>. If tlie Census 
authorities had omitted all married women over 50 or all 
under 20 years in their enumeration the mode would be still 
ill the same place. That is very important when we are 
dealing with inaccurate figures. In those curves which have 
a distinct mode, where the curve first tends upwards^ reaches 
a height, and then comes down again without ever pausing or 
returning to a second height, and where there is a certain 
symmetry or similarity of distribution on either side of it, in 
such curves the mode is of special importance. If, on the 
other hand, you have a regular mountain range represented 
by your curve, the mode is probably of much less importance. 
If you have a single peak it is probably of importance. But 
though it is important in itself it is quite insufficient to 
describe the curve; it only tells you the position of one point; 
it does not tell you the steepness on either side, or the distance 
from there to any assigned point. 

The median is affected by extremes to some extent. If 
the authorities had omitted all the married women over 
50 the median would of course have been shifted, but not 
very much, for the area, wdiich would have been left out 
at the extreme right, when halved and distributed in the 
neighbourhood of the median would be found to have 
caused only a very slight displacement of it. That can 
be verified from Diagram lY. To take an example wdiieh 
can be supplied by the diagram, suppose you omit all 
those beyond the 800 per-cent, which gives those above 55, 
then the line through the 400 w^ould give the median, which a 
very rough measurement gives as 33 years. That is to say, 
the median has only been shifted five years by leaving out 
that immense number. If, instead of omitting these people 
over 55, the Census authorities had simply said, Here is a 
married woman, obviously old, we do not know her age,” and 
had entered her in that category, it would not have affected 
the median in the very least. The position of the extremes 
does not affect the median, only the number of instances. In 
the statistics with which I personally have to deal, often all 
that is known is this number. In this respect the median 
is very superior to the arithmetical average. The same applies 
to quartiles. If we do not know the exact of the 


: MY/ kiYjVY till’ If vuii decide twij 

'/..iriii-- aiei th^ iLr-lLi.^ p >!i Lave three jcjiiil.- on tlie ogive 
, :hrtY/ i ii: the In-ts’trraiiij fruiii wliicli tlie whole 

r;ii; ■■.'!! r'!i he e-'li^triieted Tvitli fair liceiiracy. The arithmetic 
MVt^r:rjie the averag'e/'^ gives the abscissa of the 

e\>znTe .'■! aTaviTv cd the gruup when plotted out as in 
Dhiirnun Ill. The ariilnnetic average facilitates certain 
evir.]'iitatio!is^ hut, in my experience, it is the least valuable 
'/■t liie iiitniiis or a'verages which can be calculated; other 
pi-.-picT exiJerieiiee may be diifereiit. It is very liable to 
err-jr. If a part of the group is accidentally omitted the 
a’*rragi‘ at unee affected. If the iiiinibers are correct and 
Tlie position.- not Tory far out, you would find by experiment 
that the arithmetic average lias not moved much ; but directly 
any nmiiber'- are left out, the arithmetic average is disturbed. 
But the rea-oii I di>rrust the arithmetic average and do not 
advucate its u>e is, chiefly because it renders such fallacious 
arg-iiments po-sible. If you are comparing one group wflth 
another, after a little interval the arithmetic average may 
liave remained quite steady when the group has changed 
coiidderabhp both the extremes having come in to^vards the 
mean ; or it may shift whcm the group has not really changed 
its character, but only shifted its position a little. Any 
particular change of the arithmetic average may correspond 
to an infinite number of different kinds of change in the 
group; and it is very often pointed out that a certain group 
has changed, that something has improved because the 
arithmetic average has changed; whereas it is only shifting 
the relative positions of two groups which are not connected 
in reality.* If we have a perfectly homogeneous group, for 
instance, if wflth wage statistics, we deal with a set of men 
doing similar tvork and earning similar w'ages, a change in the 
arithmetic average is significant; but if we are dealing with a 
composite group composed of skilled and unskilled workmen, 
two homogeneous groups merged into one, the arithmetic 
average might increase either by the higher group ascending 
a little while the lower group went down nearly as far, or the 
other way about; or by a combination of those two things. 

* Thme two sentences apply also to tlie median, but tbe present unfamiliarity 
of tbe term will suggest cantion in using it; while, as a matter of fact, the 

arillimetic aTcmge is used very carelessly. 



So the arithmetic average can never give tietiiiite 
and very often gives fallacious iiif«n*iiiatioii. 1 have ii«»t tiiiic^ 
and perhaps it is not necessary^ to dwell upon this and 

refer to the correction factors for Urban death miv>. The 
necessity of that method illustrates my meaning in saying that 
before an arithmetic average is used^ it is necessary to make 
sure that the group is homogeneous. 

The quartiles and the median not only give the definite 
position of the median, but also a measurement, Avliich serves 
to show how the curve is dispersed from its central position. 
The distance between the two quartiles, 18‘4 years in this case, 
shows to some extent how the curve is dispersed from its 
central point. That I shall return to in giving other 
measurements of this dispersion. 

If we were dealing with a group that did not give any 
such regular figure as this, a group to which the mode was 
certainly quite applicable, it would probably then be best not 
to attempt to draw any continuous curve at all, but to keep to 
such a diagram as that on page 5, and to calculate the deciles 
as accurately as possible. By making some simple assumptions 
as to continuity, it would be possible to calculate roughly the 
nine deciles, dividing the area into 10 equal parts, and enter 
them as a description of the group. I think that is the only 
method of satisfactorily representing an irregular group which 
cannot be divided into distinct homogeneous groups. 

CoMPAEisox OF Groups. 

The ogive diagram lends itself more readily than any other 
to the comparison of the two groups. I have selected two 
groups, which one might wish to compare, from the same 
Census table, the husbands whose wives were between 
45 and 50 years of age, and the wives whose husbands were 
between 40 and 45, which are represented by the lines 
LL and KK respectively ; and I have calculated, by one 
method or the other, the mode, the median and the quartile of 
those groups. Thus, for instance, from the curve K, of all the 
wives whose husbands were between 45 and 50 years of age, 
as many were less than 45*5 years as were more than that; and 
similarly for the quartiles. The curves are very similar, the 
husband curve being four years to the right of the other. 
The method needs no further comment. 



1) I A ORAM 





If.LrSTElTiOX OF USE OF JEll MeDL\F 

I may take one example to illiL-trate tiie ii^e *.,d‘ tlie 
median. Tlie diagram on p. L'S represents tlie weekly wao“es, 
valiiiiig everything that is paid in goods and not in niuiiey 
at an appropriate rate^ of three ela>ses of lahourers in 
Englaiid^, nameKy Artisans in Provincial TuwiiSj siicli as 
Birmingham^ Agrieiiltiiral Lahuiirers—the average for the 
whole of England—and Labourers in the same towns from 
which the Artisans were selected. The figures are rather roughs 
and there is no material for making tlieiii exact; but I think 
the lines drawn represent with fair accuracy the course of 
w’ages; for if we once established the fact that all agricultural 
labourers are below the median, w'e have simply lo count them 
and not enquire about their wages. And so if we establish 
the fact that any body of men is well above or well below the 
median, w'e have not to enquire into their wuiges, but simply 
to count them ; and to find the median we have only to 
investigate more carefully the body of men whose w'ages are 
near the median; that is a comparatively easy task, because 
the body of men who are near to it are those whom w'e see 
any day in any ordinary industrial undertaking. The 
Census figures are bad for this purpose in 1902, and 
they were much worse in 1801; and there is a great 
deal of computation and guess-w’ork in determining the 
position of the median at any time through the century. 
But it can be done within certain limits of accuracy wdiere 
the task of determining the arithmetic average w'ould 
be hopeless. When we have determined the median and trace 
out the positions for 110 years we have a much more 
interesting and exact piece of information than if we had 
made use of the arithmetic average. We have the vrage of 
that man who is half way up the skilled w'age earners; but if 
we give the arithmetic average it will carry us no further; 
it is simply a numerical quotient. The line in the diagram is 
drawn through the estimated positions of the median for all 
male adult wage earners in the United Kingdom, at selected 
dates. These figures are rough, and should not be quoted 
without verification. The only ones calculated are those with 
a dot or cross in the figure; intermediate lines are interpolated. 


c 



Diauham VL 


i 


o 

o 

m 



Kough {liagriuii illuHtratiiig lliu uso. of the 




MEASUEEilEXX OF GROUPS. 


SECOND LECTCRE. 


The Staxiiaed Deviation and the Modclcs. 

The methods I have employed so far for determining- ihe 
median and the mode^, together with the ordinary method of 
determining the arithmetic average, together also mth the 
quartiles and deciles, give a series of definite quantities 
connected with the curve. Each of these quantities—the 
mode and the median—perforins the function of an average; 
that is to say, that number by itself gives briefly one of the 
most important positions, one of the most important 
characteristics of the whole curve. But no one of these 
quantities gives sufficient information to enable us to 
reconstruct the curve or to describe it completely. It is 
true that if we have given the nine deciles, including the 
median, we have nine points on a continuous curve, and in 
general it is possible to construct it with reasonable accuracy. 
But if we only have the mode, or only the median, we 
have not enough to construct the curve. My object, then, 
is to develop one or more methods of calculating other 
quantities related to the group, which will enable us to 
complete or amend the description of the group, as given 
simply by one of the averages. 

We will always suppose that the group is described in 
relation to a horizontal axis OX, and may be of any nature 
about the axis. What we have found so far in the median or the 
mode is one point on that group, one position on that axis—in 
the case of the mode the position under the highest point— 
in the case of the median the position, the line through which 
divides the curve into two equal areas—^in the case of the 
average it is the abscissa of the centre of gravity. I have 
now to find a second quantity which will enable us to describe 
or determine the shape of the curve when you are given this 
one position on it. The method I am going to take is 
independent of any assumed shape of the curve, and it is 

c 2 



'f ih- L’roaps to wliicli I referred on page 2^ 
tlie wnieli > Mii’iT} oeil !o l^e an accurate representation 

of tilt facts, &:A liait whiv’h represents only samples of a 
larger A-hj>v »'O-ervatien is not completely made. I 

lictve nr>t tlie wtdl-kiiowii iiietliod of calculating 

lltf deviaii"Ti> from tlie average, and tlieii to pass on to find 
the average deviation, the average square of the deviation, 
and the average cube of deviation. Let there be n observations 
represented by .f|. . . j-s; let x be the abscissa of the 
centre of gravity ; then 5 is the average of the group, the sum 
tjf the .rh< divided by ?i, tlieir niiiiiber. From each of the cifs 
subtract the abscissa of the centre of gra-\fity; thus Xi — Xy 
i. Those are the deviations of the observa¬ 
tions from I heir average. In some connections they are called 
the errors from tlie average, but I shall adopt the word 
“deviation” in every case. In the first place it is to be 
iiuiieed that the sum of the de\uations is necessarily zero ; for 
Si’ (z—«] = S iZ — nx = 0. 

The sum of the squares of the deviations is S^(z—5)“, and 
-S;(=r—is the mean square of the deviations, which is 

otherwise called the second moment of the deviations, about 
the origin in this case. The word moment is from a 
dynamical analogy; it is used in this connection by Professor 
Karl Pearson. 

The following notation is adopted. The moments 
measured about the origin are written g,/, /X 2 ' . . ., about the 
centre of gravity /ij, /X 2 . . ., so that 


and fii= ^ l,{x—x)=-l,x—x=0, 

n ' n ’ 


The quantities we need are not the moments about an 
arMtm^ origin, but the moments about the centre of gravity. 
!Bnt it is far easier to calculate the moments about an 
arbitrary origin than to obtain those about the centre of gravity 
by the above formulae. 




Diagbam VII. 
St-ifjfkts of Persons. 

01 j-it-r vat lulls and Curve of Error. 





Xlf 


Inches 


57-58 
58 - 59 

59— 60 

60— 61 
61—02 
62~~66 
63—64 
6 ) 4—65 
65—66 
6»6—07 


Instances 


1 

1 

6 

4 

15 

39 

74 

153 

243 

2S8 



if+ 574 ' 


xy 


xy- 


! 


Inches 


Instances 


67— 68 

68— 69 
6 t >—70 

70— 71 

71— 72 

72— 73 

74^75 
75—76 
70—77 


321 . 
245 . 
213 1 
152 I 
88 
55 
26 , 

?l 

li 


321 

826 

368 

333 

241 

185 

106 

44 

5 

6 


,000 

:,095 

1,004 

.944 

,472 

,025 

,490 

,217 

,832 

,859 


824 I 6,331 


Total .. * 1,935 | 19,431 j 207,987 j 2,348,427 


i = 1,935* 

^ = 19,431. 

^xv 

c— —^ =100419. Average 67‘54 inches. 

B 

2 =^, 987 . 

= 107*487 = ft'2. 

= 1,213*66 = /s. 

rred to average— 

=;t'g—p=6*647 (2n(l moment). 

iV=6*564 (2nd moment corrected). 
By parabolic interpolation—Mode, 


a (standard deviation) = =2*562 

c (modulus) = V2mg= 3*623 inches, 
ft 3 =fx\-Bfi\x + 2P== 1,213*66-3,238'12 
+ 2,025*24=*78 (3rd mome 

j = ^ = + ‘016 (slcewness from moments) 

j = + *06 (skewness from ohsen^ations). 

7} (mean deviation from average) = 2*01 (ii 

Median=average —Ijc=67*47 (inches . 

Mode=average—3^=67*32 (inches). 

67*303 inches. Median, 67*566 inches 





h mill now lie coiiTeiiieiit to follow the figures on tlie 
taiiile aiitl iliairraiii ailjoiiimg. The figures are taken from the 
rei:>-ri tlio Aritliropoiiietrieal Committee of the British 
A-iwiittioii, in Tliej are selected merely as being a 

eonyeiiinit yroiip by which tu explain the calculation of these 
iiii .!neni>. The lieiirhts of 1,935 persons were given as between 
eeriaiii inciie>, between 57 and 58 inches, as under the column 
lieaileii ...r + 57i, I take the origin at 574 inches, and the 
absei-sa for the successive groups are 1, 2, 3 .... 20. The 
nijiiiber of instances in these various groups are those given 
in the second coluimi, under the letter y; one person under 
58 inches, one between 58 and 59 inches, and so on. The 
instances in this case occur in groups, and we are not able to 
separate tlieiii bv means of the data, hence each deviation will 
occur in most cases more than once. Thus, a deviation shown 
between t>4 and 65 inches occurs 153 times. Instead of adding 
the 'T simply to obtain the deviation we multiply each 
deviation by the number y, the number of times it occurs, and 
S 0 obtain the third column xy, whose sum is 19,431, which is 
the first moment about the origin. The sum of the deviations 
is to be divided by 1,935, the total number of de\nations, to 
give the first moment, namely, 10*042, and this gives the 
position of the centre of gravity measured from the origin^ 
57| inches. The columns under and xy^ require no 
explanation. The totals 207,987 and 2,348,427 are divided 
by ?i, giving 107*5 and 1,213, fi 2 and fz'z in the notation adopted. 
It now remains to reduce these moments about the origin to 
the moments about the centre of gravity, by means of the 
formulae given above. 

The practical simplicity of evaluating the moments by this 
method arises from the fact that we are dealing in the x^s 
with a series of numbers ascending in uniform order, 1 to 20, 
and that the whole arithmetic computation is very simple and 
very easily checked, whereas if we proceed on the direct 
method of writing down the position of the centre of gravity, 
which will naturally not be an exact number, each of the 
deviations will introduce as many decimal places as are kept 
in our calculation; and the squaring and cubing will be 
very arduous, and we have no ready means of checking our 
results. It is therefore wort-h while to take the formula and 
ch.<x>se our origin so as to give the least arithmetic work and 
obtain the second and third moments indirectly. There is a 




23 


small correction to be made for tlie moments so calenlate^d for 
rlie second, fonrtli, and other even moments. I will deal only 
with the secoml- It is to be observed that in the whole 
calculation it is assumed that all the persons in a particular 
group are exactly at the middle of that group, e.g.^ that 
153 persons in the d-i to 65 inches have the height exactly 
64| inches. It is obvious that that will not be the ease, and 
it is easily seen that that wdll introduce a definite error in the 
calculation of the second moment. For if we take one of 
these groups in particular, and make the assumption that the 
whole number lies at its middle point, we are representing it 
by a rectangle instead of by a trapezium with the side nearer 
the centre of the group longer than the other; a little 
consideration -will show- that that makes the second moment 
too great. Mr. Sheppard has shown that under certain 
circumstances it will be sufficient correction to subtract the 
fraction -jV from the second moment calculated on the 
assumption of uniform distribution at the middle points of 
the groups to obtain a moment in a true approximation. 
On page 21 the corrected moment, m 2 } is 6‘564, while the 
uncorrected moment, fi 2 y is 6*647. The correction will be 
^ only, if the difference between successive groups is one 
unit of abscissa; if the difference was A, we should have 
to multiply by A-; but for practical work it is best to take 
the unit as the distance between groups which we are dealing 
with, and hence the correction is in the form which is 
of practical use. 

The standard deviation is defined as the square root of 
the second moment about the centre of gravity. Professor 
Karl Pearson used cr to denote it, and <t is 2*562 inches in this 
case. It is sometimes more convenient to deal with the square 
root of twice the moment, which is called the modulus, and 
denoted by the letter c. Professor Edgeworth uses the modulus, 
whereas Professor Karl Pearson uses the deviation. We shall 
see the appropriateness of the modulus when we deal with the 
curve of error. The modulus for this group is 3*623 inches. 
It is a very remarkable fact that the modulus for the height 
of groups of men is almost universally very nearly 3*6 inches. 
Professor Bdgworth gives a list of 10 such groups in the 
Jubilee volume of the Journal of the Royal Statistical Society: 
the moduli are 3*6 (United Kingdom), 3*6 (England), 3*4 
(Scotland), 3*6, 3*7, 3*8 (United States), 3*7 (Belgium), 



M‘7 !l;ih „ I liirrek eiill attention to that in passing, to 
ini idioi tliai tlio modulus i< of real sitfiiificaiice and 
ijiO a inero liiilhiDoneal t'alriiliitioii. 

Av£i:AO£ Dlviatiox. 

For *Ait iie.tt few minutes I propose to assume ttat the 
eurve I am dealing is syminetrical about its centre of 
graritj. Tlie eiirre of heights which is sketched on page 21 
is in fart very nturlj symmetrical. If the curve is actually 
sTiiiiiirtrical all tlie odd iiioments are easily seen to be zero, 
while the even riioiiients are not. Then this quantity a or c, 
wliicliever we adopt, serves to measure the distance of the 
curve from its average, to use a clumsy phrase, or the 
flisi'tersion about the average. Before discussing the 
appropriateness of this measurement I have to explain two 
simpler methods of measuring the same thing. One based on 
the first power of the deviations, and the other based on the 
distance betiveen the quartiles. First for the average or mean 
deviation which, in ihe notation I am using, is called rj. 
If we write down the deviations in the method Just defined 
and add them up, we obtain zero; but if we treat all the 
deviaiioiis as positive and add up their absolute values we 
do not obtain zero. The calculation is as follo^vs:—Treat the 
negative deviations and the positive de\uatioiis separately. 
The Slim of the negative deviations is 

S^{10-0419-z) = 10-0419 X 824^6331, 
frcjin tlie figures to the left in the table on page 21. The 
sum of the positive deviations is 

:2^bc-~10‘0419j = 13100---10*0419xllll, 

from the numbers in the right compartment. The average 

deviation is, therefore, 

;l:jl0(J-~-10'O4i9(llll~824)~6331}-^1935=2*01 (inches). 

Pp.obaele Eeeoe. 

The other simple method is based on the quartiles. 
Calculate the quartiles of this group by any of the methods 
already given, and you will find them to be approximately 
65*8 inches and 69*3 inches. Since the median as given on 
page 21 is 67*566, one quartile is 1*78 inches below the 
median, and the other 1*72 inches above it. Half the distance 
between the quartiles is called the probable error. It is a 
term which is so firmly in use that there is no hope of 





improving it, but it one of tlie most erroneous terms in ii<e 
in iiiarliematies. Half rlie distance is 1*75 iiielio. It we take 
a person at ruiidtjm frraii tliis group aiul measure bis lieittlii, 
it is as likely as not tlie height will be found to be btstweeii 
the quartiles, for the space contained between the ordinates 
at the quartiles is exactly half the wliule curve, hence the 
phrase probable errord^ 

If we were dealing with the special distribution determined 
by the e€|uatioii to the curve of error (see p. 54 , w'e should 
have the following relations : average deviation = niudiiliis 
-e- and probable error=modulus x *4769. These relations 
are approximately true for this distribution of heights, for the 
values of the mean deviation found from these equations when 
the modulus is 8*625 inches are 2*04 and 1*78 inches 
respectively, while the numbers found above are 2*01 and 
1*75. 

These methods of describing groups are, how*ever, 
applicable to groups tvhich do not conform, even approximately, 
to the law of error. I shall now treat them without the 
assumption that they do conform. The probable error is the 
measure of dispersion, which is most quickly calculated. We 
can WTite down the quartiles very rapidly, and take half their 
difference at once. But that only takes into account the 
positions of the two quartiles, and does not take into account 
the positions of the extremes, but only their size, and, depend¬ 
ing as it does only on two quantities, is liable to a large 
amount of accidental error. The mean deviation on the other 
hand takes into account the position as well as the number of 
all the quantities, and is therefore less liable to accidental 
error, and also it does not take at all long to calculate with 
simple numbers. The modulus and the standard dev^iation, 
again, take into account every observation, but they give extra 
weight to those which are a great distance from the average. 
In some cases that is right; in others it is not. If we are 
basing arguments as to the group and the shape of the group 
on probability, then very likely it will be correct to give this 
extra weight to an object which is far from the average, for 
the farther from the average the less the probability, and in 
some cases the probability diminishes very rapidly as we move 
from the average. If we are not going to make assumptions 
about the shape of the curve, nor apply the principles of 
probability, I do not know that we shall find any justification 



!-r lakiiVL? th^ -f|!iare, ratlier tlian tlie meaiij deviatioTi. As a 
riik- wi* may >rij tliat wt* appropriately from tlie 
|in»haiot“ oitmi- to tlie mean error, ami from tlie mean error 

lilt* -lantlanl error as tlie ciirves with wliicli we are dealing 
lioeuiiie !ii*"*re ilf^inite and perfectly coiitinoons, and 
a|ipreiX 3 iiiate iii^re ami mure nearly tt> a curve witli a definite 
alLmi>raic‘ For very rough measurements which are 

le'if c*oi 3 tii 3 iiui 2 > anil which are not to be corrected, the probable 
enxir, iiiea-iirecl as half the distance between the quartiles, 
will Very likely be the best measurement. As the curve 
attain" a definite^ shape, and as tve are able to treat the 
observations as more and more continuous, it wall be well to 
take the mean error, and finally, if we have a perfect 
silgebraic curve, then very likely it wfiil be most correct to 
Take the standard deviarion.^^ 

Meascremext of Skew’xess. 

Xow to pass on to unsyinmetrical curves. We have 
olitained by one of the averages the position of the curve and 
by one of these measures of dispersion one measure of its 
."liape. We shall now” obtain the measure of its want of 
symmetry, or briefly, of its skewness. Most curves have some 
degree of skewness; but in some cases it is negligible. 

As an example of a curve writh considerable skewness, we 
may take Diagram III, on p. 6. The curve is elongated to 
the right; the mode is to the left, the centre of gravity to the 
right of the median. This is the general order of these three 
averages. If a skew curve is formed by stretching a 
symmetrical curve to the riglit, the stretching shifts the centre 
of gravity, relatively to the median; or, from another point of 
view, if a curve is heaped up to the left and stretched to the 
right, experiment will show that the line through the median 
is to the right of the highest point. 

There are very many possible ways of measuring this 
skewness. One obvious measurement is simply the distance 
of the centre of gravity from the median. Another is to use 
the quartiles. Call the positions of the quartiles, Qi, Q 2 , the 
position of the median, 0, of the mode, M, and of the centre of 
gravity, G. In a symmetrical curve the distance Q^O is equal 
to the distance OQi, whereas in a skew curve it will not be. 
In a skew curve stretching to the right, the upper quartile 
to the right is further from the median than the lower 



quartilej and tlie difference between these twf^ iiiea-niv- will 
form another means of estimating its skewness. The iliin! 
method is to take the first power of the deviations^ and coinpare 
the excess on one side the centre of gravitv with the ili^feet 
on the other. The fourth method is to take the third pcnver 
of the deviations and consider its abs-okite magnitude. All 
these methods have their uses. I propose to deal with three 
of them. I will fir^t take that which is aritlimeticallj the 
simplest. The simplest iiieasorementj that wiiich you can 
calculate almost instantly^ is the difference between the 
distances from the quartiles to the median. But that gives a 
concrete quantity; in the case before us so many inches ; 
whereas it is convenient to measure the skewness as an 
absolute quantity^ on a scale from +1 to — 1; and we must 
therefore reduce this concrete quantity to an absolute 
quantity. The proper method of doing that is to divide it by 
the modulus^ which is a concrete quantity^ in this case so 
many inches. The one dmded by the other gives an absolute 
measurement, w’hich w’ould serve to measure the skewniess. 
But it is better to multiply that ineasureiiient b}’ the constant 
3*29 (see p. 36, below) before using it, to bring it into 
conformity wnth the theory of probability; in the same sort of 
w^ay as the multiplication of the second moment by 2 to get 
the modulus brings the standard deviation into conformity 
with the methods of probability. 

Another and yet simpler method of measuring almost 
exactly the same quantity, is to divide the difference betw^een 
those two quantities by their sum, that is to say by twice the 
probable error; then if we multiply that by 3T4 (see p. 36, 
below), we shall obtain the same measurement very nearly 
as before. This method supplies a good rough measurement 
which is very rapidly calculated; we w^ite down the median 
and the two quartiles, calculating them roughly or by one of 
the more complete methods given above, and at once write 
down the probable error; by this means the skewness of the 
group can be calculated in five minutes. But this measurement 
depends on the positions of three points only, which are 
subject to accidental errors, and the parts outside the quartiles 
have not much influence on the result. 

A measure, which is influenced by all the items, is obtained 
by taking the third moment about the centre of gravity; this 
in itself is a measure of the skewness, but it is not of the 



E'ijii" tliiii -!:-!'ivT it a concrete quantity of tlie order of 
,4 !!ip * 11 - been cubed ; to reduce it to an 

t|miiithy it mu-: In* divided !j\' r-^ Calling this 
[iii'ii>rire j, wc have j = which in the group given on 

p. 21 i-' epnal t*.* — ’lilo. 

in a curve wiricii i- nearly symiiieirieai and approximates 
[•> tlic curve errfjr, ilie distance between the arithmetic 
iVeraiTc and the iiiediaii will be ^ jc, anil the distance between 
the ariliimetie averaire and the mode will be jr, and these 
rtdatiuii' supply a third iiiethud of otimating the skewness. 

Diageam Till. 

Builji of BfhjidM Coahminers. 















29 


First estimate the modulus, and then calculate the p«»itiMn 
either mode or median and the aritlinietie avera^u'e, tlivide tlie 
distance by c or by c, and we obtain j. Bur that i> not an 
accurate method, if we use the mode, which cannot lie 
precisely determined ; while if we use the niediaii, ive are 
depending upon a single position. The formula} to be 

preferred are j= xd*14, and j = /i 3 -~rb the furiner 

perhaps when the curve is not approximately the curve of 
error. 

The adjoining Diagrams illustrate the practical use of the 
technical quantities -which I have now discus^ed. In 1896 
the Belgian Government undertook an Industrial Census, and, 
amongst other things, they collected figures of the wages of 
most of the w'orkpeople of Belgium. We have here in 
graphic form the daily wages of the Belgian coal miner.'- in 
1896. A supplementary enquiry ivas conducted in 1900 over 
nearly the same area, and the result is given just below. The 
methods we have developed give us a rapid means of 
comparing the results of those two enquiries. It is the 
rectangular figures only vdth w^hich -we have to deal at 
present. The average increased from 3*68 francs to 
5*36 francs between the dates; the modulus from 1*20 to 
2*047 francs, the skewness changed from a negative skewness 
of — *10 to a positive one of *22. Those three statements 
rightly understood and interpreted give in a brief form the 
result of the Census. The average has increased, more money 
went in wages, and the modulus and standard deviation has 
increased very much. There "was a development, therefore, 
of wages away from the average, either by highly skilled 
workers increasing their wages greatly^ or by a body of 
unskilled workers coming into existence. If you look at the 
curve you will see the dispersion is chiefly increased to the 
right, and that increased standard deviation is due either to 
the inclusion of a higher grade of workmen than had been 
included before, or to the fact that the higher grades of work 
had obtained a great increase of wages. I am inclined to 
think it possible that the increase of dispersion is partly due 
to the erroneous inclusion of people in the second enquiry 
which were not included in the first, but I have no means of 
going behind the figures. The change of j comes from the 
same sort of reason, that a body of skilled workmen were 



•btaiTiiiiif liiaiier wages, or tliat tlie number of skilled 
w'iirkiiirii liail iiierea-Cil. Either of these means would 
iiiiTea>r j in ai pn^it^v^^ ilireet2Hii. This use of the letters 
iiKiv hv left fTr rM!i>iiIt'niti'ai. 


lietiiniitig ff'iY a moment to the use of deviations in 
rMinieeiirtii with the iiiedkii and arithmetic average, I have 
t*i poiiit out the curious relation between the tw^o. The 
aril limet if average i< that quantity from wdiieh the sum of 
lilt* cleTiatiuiis i> imtliiiig, and the sum of the squares of the 
ileviiitions the least possible. The second result is obtained 
instant jy from the formula already given, fiJ The 

sum of the squares of the de^nations from the arithmetic 
aveitiire is the sum of the squares from some other origin 
fi.j ; and fejiii that fornmla is alwaj’s less than The 

meiliaii on the other hand makes the sum of the first powers 
uf the deviatiuiis a miiimiura, and the sum of the zero poivers 
Zero. If we take the zero pow'er of the deviations, each 
deviation is replaced simply by 1, and then from the definition 
of the median we find the sum of the zero powers measured 
from the median is zero. That the sum of the first powers is 
a iiiimmum can be readily demonstrated, most easily by an 
analogy. Suppose that it is required to run from a telephone 
exchange separate wires to everyone of n places in a straight 
line, where should the exchange be placed, so as to use the 
least total amount of wire ? At the median position. For if 
you move from the median position to the right or to the left 
you will find immediately that you are adding more wire than 
you are subtracting. Supposing there are 20 stations, and 
you have a position between the 10th and 11th; if you move 
to a position between the 11th and 12th, you have to increase 
your distance from 10 stations and diminish it from 9, in every 
case by the same length of the wire. The wires correspond 
to the deviations; and the sum of lengths of the wires is the 
sum of the lengths of the deviations. Consideration of this 
illustration will show * that the sum of the deviations is a 
minimuin when they are measured from the median, but that 
the median is not quite determinate, for if there are an even 
number of stations the sums of the deviations measured from 
all Joints between the two central stations are the same. 



MEASUREMENT OF GROUPS. 


THIRD LECTURE. 


The Cueve op Error. 

The subject discussed in this section is full of technical 
difficulties, and it will be impossible to cover the subject 
adequately in the short space allotted to it. It must then be 
regarded as containing rather a summary of those important 
points connected with the theory of error, which I shall have 
to use subsequently. "While making it as complete as possible 
in itself, in several cases I shall have to ask acceptance 
without proof of results which I shall find it necessary to use 
at a future date. 

Among the various shapes assumed hy groups of observations 
of any kind w'bich are (as in the groups already taken) 
grouped in a more or less regular way about the central line, 
there is one distribution of the various deviations about 
their centre which is regarded as normal, and the curve 
representing it is called the curve of error. And it is the 
deduction of the equation of that distribution which I have 
first to deal with. After we have the equation we will discuss 
to what extent the normal curve is actually found in the kind 
of statistics with which we deal. The normal curve can be 
obtained from the statistics found in games of chance, or 
from the statistics which may he obtained by counting the 
occurrence of specified digits in mathematical tables, or from 
anthropometric measurements, or again from some groups of 
social statistics and from some groups of vital statistics. The 
deduction of the equation I am going to take is the only one 
which I think lends itself to purely algebraic treatment. 
Other deductions depend upon the use of diSerential calculus 
or even of the theory of functions. 

Let us consider some occurrence for which the chance 
is p, the chance against so thatp + g = l. Let us suppose 




“Vr::: may or may init give the occurrence takes 

y^i % am! UL^aiii, and that in each n times we 

c :iL! *1 ’W ^ fnm -i;cre/» i- olitaiiied. For iimtaiice, suppose 
VO. pit ‘ 1 : a .11] ;j liiiiio and count Iiow many heads are found 
I tL. :: I’riv.'iit T;io experiment again and again and 

:i: oarL c:oe the iiiiiiiher of heads, that would give a 
M*rie- llie J have in iiiiiid. For a small number of 

ex]}ori!ie*ii!-, if vndi >el of cxperimeiits contained IG tries or 
any Ninall fiiiite liUiiilier, it is ea>y tu set down the probabilities 
tho vari-Miix luimbers of successes. And it is also clear as 
MOjii as die algebra of the method is tackled^ that there is a 
limit towanis wliieii these chances tend as the number of 
experiiiieiii- in each group is iiidefiinrely increased. "What we 
have to do first is to find the limit towards which such a series 
of experimeiiis rends when the is increased indefinitely. 



llie diagram annexed represents the various chances of 
the iiiiiiibers of heads in the experiments of pitching a coin 
12 limes. The most probable niimher of heads is of course six, 
the least probable none, or 12, and the probability of 0, 1, 2, 
up to six, is continually increasing. If we erect 13 ordinates 
representing the probability of no heads, one head, and so on 
up to 12 heads, we get the diagram marked + If we 

take another kind of experiment where the chances for success 
and failure are not equal, e.g., where the chance of success 
is *3, and perform the experiment 10 times, we get the 
probabilities of one, two, and so on up to 10 successes 
represented by the following diagram :— 


(-3 + *7)i« 



33 


The first curve is of course symmetrical, the second curve 
unsjnimetrical. What we have to do is to tlecliiee the 
shape of the curve when the index is infinite, whether the 
chance in favour is one-half, or whether the chances for and 
against are unequal. 

If p is the probability of an event, and then the 

pr«'ibability of m successes in ii trials is - — anti 

m n — m 

successive values of m give the terms of the binomial 
expansion 

Assume that np is integral. Let np=T, nq=s, r + 8=n. 
Denote successive terms by 14 , Ui . . . 11 ^^. 

Then u^, which is the greatest term, = ^p^qK 


) ... to x—1 factors 

- IN . " 2% 

f 1 -f I 5 J * • * factors 

log = log Us + log (^1 — + log ^ + ... 

+ log(^l-'--—)-log(^l+ -j-logj^l . . . 


-lo< 




= logt{,- 


1+2+ . . • -1-.B — 1 1^ + 2'^ + ... +,e—1- 


r 2 r^ 

1 _j_ 2 -f* . . . q- aj I2_j_22_j_ ^ ^ ^ 


5 ' 2s" 

-Iro-./ a’(-B + l) («-l)*(2,i;-l) 

-logM, 25 125^ 


&c. 


, {x + l)x{2x+l) p 
+-^ 


=logw^- - h : :: + .o v ' &c. 


2 rs 


2rs 


6 r"s" 


, XT x{q—p) Q^[p^—q-) p 

=log%— ^ -h ^ &c. 

^ 2pqn op^qV 

Let aj"=z" X 2pqn=z^c^ 

log W,+*=log Us-^— &c. 

vZpqn Sv2pqn 


D 



34 


The as>iimptioii that np is integral made above does not 
alfeet the limiting form of the equation. 

It is at this point necessary to consider wliich terms are to 
he rejected, when n is made infinite- If x is finite^ if we move 
through only a finite number of terms from the greatest 
unliiiatto the ordinate iij+r equals the ordinate iig. This part 
of the curve approximates to a horizontal straight line. To 
take a iiiiiiierical instance, the chance of obtaining 499 heads 
ill tosses is practically equal to that of obtaining 500 

head>.. On the other hand if x is infinite, it appears that 
zero. If the figure is drawn so as to show finite values 
of .z we obtain a horizontal straight line ; but if an attempt is 
made to include infinite values of x, the curve becomes the axis 
of X and a finite vertical line through the origin. 

But it becomes clear, if we examine the shape for different 
finite valuer of w, that the carve has a definite shape and finite 
curvature near the centre. Before we go further let us take 
an analogy. If w^e take an hyperbola and try to include the 
whole curve in our figure the curve will coincide with its 
asymptotes. In order to draw the curve so that the part 
between the asymptotes and the vertex can be seen, we must 
adopt a particular scale so as to obtain the length from the 
vertex to the centre as a finite quantity. Again, if we pass 
from the ellipse to the parabola by the process of pushing the 
centre to infinity you have, in order to obtain the finite part of 
the parabola at ail, to make the hypothesis that y^x is finite. 
In order to get the finite part of the curve of error we shall 
have to select that part where the ratio of to n is finite. 
Then it will be found that we shall obtain the part of the 
curve that has a definite curvature and a definite shape in a 

finite form. Let us assume, then, that — is finite : and let us 

n 

substitute for ~ the quantity z- with the factor 2pq, The reason 

for that factor will soon be obvious. Take <^=2pqny so that 
«=:zc. We then obtain the equation log%+ay=log%—z^, 
when all vanishing terms are neglected. If the above 
deduction is carefully examined it will be found that all the 
terms omitted are infinitesimal in comparison with those 
retained, when % is infinite. 

Eemoving logarithms, and writing y for %+a., we have 

^ it® 


35 


We are still at liberty to clioose a scale for the ordinateSj 
and it is most conTenient to choose that which makes the greatest 

ordinate = - for then the area bounded bv the curve, and 
C v^TT . ^ 

the axis of x becomes unity; then each part of the area represents 

the probability of certain occurrences^ for the whole curve 

represents 1, which stands for certainty. An alternative is to 

X 

take the ordinate as —^ so that the area of the curve is 

C V TT 

where X is the number of experiments. Then the area standing 
on any part of the axis represents the most probable number 
of events corresponding to that part. 

Xow let us go back and take the terms we have so far 
rejected, which involve Each of these contains the 


factor. 




It is convenient to call that quantity 2j, 


\/2pq7i * 

for then we shall find that j has the meaning already assigned 
to it (see p. 21, and for proof see p. 36). 

Ee-wmiting the equation with that notation, and then 
expanding the part which contains j and neglecting the 
powers of j, we have 




1 



It is easily seen that jc=—j) =p — | |The centre 

of gravity of that curve can be shown to be at the origin by 
integration. The area of the cnrve is of course the integral 
of ydx, taken between plus infinity and minus infinity. The 
part of the integral which does not contain j is a well-known 
definite integral, which equals unity. It can be seen that the 
part containing only odd powers of x does not affect the 
definite integral. Hence the area is unity. 

Now let us calculate the error of mean square of the curve 
from the equation. It is obtained by multiplying the element 
of area y. da? by its distance {x) from the centre of gravity, 
and adding up all the parts so obtained, and then dividing by 
the whole area, i.e., unity. 

It is easily seen that the j term does not enter into the 
result, which is therefore yx^.dx=-^c% by integration by 
parts. Comparing this with p. 21, we see that c, thus 

B 2 





rax'ukic'il, is the moiluliis, as there defined. The third 

'—x 

i> j diTidetl hy tlie area, wMcii is uiiity; 

iiHesrnitiiiir hv parts we olitain that the skewness, as defined 
t>ii p, 2s, equal hj the j in this equation. The constants 
in tile r4|iiatiM!i lu the curve nf error, a> ivritten above, are 
tlitui ihv iiii-idiihis and ^kewiie.ss as defined for curves in 
geri€U"aL The avenis'e deviation, as defined on p. 24, is found 

*-x ^ 

iiv iiitritniiiiici* j i/.v.d/r, to be —and does 

" w .. Jc ' V 

not involve j. 

The equation in its integral form is, if stands for 

area on abscissa from origin to .? 



the lower sign being taken when .t is negative. 

This does not admit of any simple evaluation, but it has 
been tabulated for a wide range of values of x^. From these 
tables it is found that the probable error for the 
pvmmetrical curve (where j is zero) is c x .4769, which is written 
pe. For the im symmetrical curve the distances between the 

median and the quartiles can be sliovm to be pc+ 

while the distance between the centre of gravity and mode is 
jctj and between the centre of gratfitj and the median is 

I. Jc as used on pp. 27-29 above, where the resulting numerical 

values are given. 

The effect on the curve of the j term is to stretch the 
carve to the right, heaping it on the left at the same time, the 
sort of figure which is indicated in the second diagram on p. 32. 
Actual examples of the curve for different values of c and j 
are given on pp. 21, 28. 

The tables give the integral for the argument -, not for 
iq and before they can be used the observations must be 

* See Burgeu^s Maihematieai Tables^ Merrimam^s Least Squares, p. 186; 
Momlefs Blememts of Statistics, p. 281, and p. 332 (2nd Edition); and Journal 

of the M&yal Siatistical Society, 

t See Blememis of Statisiics, p. 331. Hence, OQg — OQx= whicli 

gives resnlis on p. 27 and p. 29. 



37 


reduced to the centre of gravity as origin and e as unit. 
Then if we find in the table that the integral function, ejj,j is 
•455 when the argument = 4-l’387*^ we are to understand 
that *45-5 of the whole area stands on the axis of .r between 0 
and 1*387 of the modulus. The tabular statement then shows 
the various fractions of the whole observations vrhicli may be 
expected (in an infinite number of experiments) to lie betw’een 
the most probable value and various values with an assigned 
deviation from the centre. Thus wfith the spiiinetrical curve 
of error^ one-quarter of the observations may be expected to 
be above the most probable value by not more than *47 of the 
modulus^ one-third by not more than *68 of the modulus ; all 
but 2 per 1000 are separated by less than 2*2 of the modulus 
from the most probable value; the chance of a deviation of 
5 times the modulus is less than 1 in a billion. 

Supposing we are given a set of observations which we 
have reason to suppose should arise from the distribution 
defined by the symmetrical curve of error^ what particular 
curve of error are we to fit to our observations ? The problem 
is not very important in itself^ but the method of solution is 
very similar to the method which underlies the principle of 
least squares and of several other formulee. The only things 
which we have a possibility of choosing are the abscissa of the 
centre of gravity and the modulus. 

Let Xiy , Xjj, be the deviations of the observations 
measured from their average. The separate chances that these 
should arise if the equation of distribution is 

1 — 1 _ ~ 

y— —c2 are — y=e 

CV TT CVTT 

where r is given successive values 1 ^ 2 . . . 

The chance that should occur together in a given group is^ 
by multiplication^ 

c“”'.7r~ 2 —^) 2 ~-c 2 j.==p (say). 

Now on what principle are we to find out the values of 
c and hi Of all the curves of error from which these 
observations may be supposed to have arisen there is one curve 
from which they would arise with the least improbability; to 
find this we have to make P a maximum, h and c are quite 
independent. Then the differentials of P with regard to h 
* Whicli is the case when j = + *073. 



38 


anti ^ mii>t each be zero. The first gives that h is zero* and 
tile centre of ifravitj of the observations is the origin. The 


second shows that is the mean sqnare of the x^s.f So 

V 2 

that to- choose the normal curve which fits the observations 
liest, in the sense that they w^onld have, arisen from that 
ciistribation with the least improbability, we must take for the 
centre of the curve the centre of gravity of the observation, 
and for the modnlas the error of the mean square multiplied 
by 72. 


It will be noticed in the proof that in a sense there is 
only one symmetrical curve of error. We can reduce any 
carve to tlie form y = by suitable choice of scales for the 
co-ordinates; but if we are taking two groups measured in 
the >ame unit, for instance, both in inches, or shillings, or 
years, then the x axis has concrete units, the unit distance 
stands at one inch, one shilling, one year. And if we take 
two separate curves both measured in inches, work with the 
same unit of abscissa, and make the areas each unity, we do 
not get the same maximum ordinate. The finite part of the 
curve with the lower maximum ordinate stretches further to 
the right and left than the corresponding part of the other. 
As long as we deal with concrete quantities we shall find 
that the quantity c enters into the shape of the curve ; and 
the comparison of any two curves is made by means of the 
values of c given in terms of the unit of abscissa. The quantity 
j is independent of all concrete quantities, and is an absolute 
measure of skewness, as already pointed out. 


* =( 2»I:).P= -2»^.P=0 when k is 0. 

Oii 

^ ~( e ^ — ■^.P=Q, when c^=2Xar e n, since k is 0. 




Unit S*f32S 

OCHES 


Pes 

iJifferfcsee 

tram. 








Eormal . 

Inezes 

T 

F(t) 

Y(r) 

Cakalated 

Aetna! 

Difference 

curve 





0 

1 

1 

1 i 

59 

-2*357 

•50) 

•511 




1 





0 

3 

4- 3 

1 : 

m , 

2*081 

*498 , 

*511 







I 


2 

2 

0 

- 1 

61 

1*805 1 

•495 

*509 




* 





10 

8 

— 2 

- 3 1 

62 : 

1*529 i 

•484 

•499 




1 





22 

20 

— 2 

— 2 ! 

63 ! 

1*253 

•462 

•471 









46 

39 

- 7 

— 7 1 

64 i 

•977 

•416 

•431 




1 





81 

79 ' 

— 2 

+ 2 1 

65 

•700 

•339 

•350 




1 





119 

125 

+ 6 

+ 12 1 

66 

*424 ; 

*226 

•231 




1 





147 

149 

-r 2 

+ 7 I 

67 

- *149 1 

•084 

*084 





1 


1 


155 ' 

166 

+ 11 

+ 11 

68 j 

■f *127 1 

•071 

*071 





j 




138 * 

127 , 

-11 

-17 

69 

*403 1 

•215 

•209 

1 





t 



113 : 

110 > 

- 3 

— 7 

70 

*679 1 

•332 

*322 

1 








73 1 

; 79 i 

+ 6 

0 

71 

*955 

•411 

*395 

1 

i 

1 







47 I 

1 47 

0 

- 1 

72 

1*231 

•459 

*442 


i 







25 

28 

+ 3 

T 4 

73 

1-507 

•483 

*467 









12 

13 

+ 1 

+ 2 

74 

1*783 

*494 

*479 









6 

5 

1 ^ 

+ 1 

75 

2*059 

*498 

*485 









3 

1 

- 2 

- 1 

76 

2*335 

•500 

*488 





77 

2*611 

*500 

•489 

1 

0 1 

- 1 

0 



64 

80 


We mil now use the height-statistics given on p. 21 as 
in example of the method of comparing a set of observations 
vith the curve of error. In the first place we take the centre 
)f gravity as the origin, namely:—67*54 inches. The modulus, 
Dy the method of moments is 8*623 inches, which is therefore 
}0 be taken as the unit. Thus 59 inches is 8*542 inches below 
}he average, that is, 2*357 times the modulus. The latter 
lumber is entered under t in the second column. All the 
>thers are calculated in the same way. Then turning to the 








40 


tiilile.' iiml tindiiiir wliat integral correspuiids to tlie assigned 
valiies of T, in the synimetrical curve of error^ we write tiieni 
under the iieadiiig F;V;. So we have that between the 
average anti 59 inches *5 of the whole curve is obtained^, that 
is to say, one-half; in the next line, between average and 
dl) inclies 49S of the curve is obtained, and so on all the way 
down to between the average and 76 inches, when again half 
tlie curve is obtained, correct to the third decimal place. We 
should not get the the true half till we have gone to infinity, 
hm the area of the curve beyond does not amount to one per 
iiiille of the whole. In this curve, for example, *462 is the 
probability that the height of a person chosen at random lies 
between 67’54 inches and 63 inches, for *462 is opposite 
63 inches. The fraction of the curve is the same as the 
probability of the occari'ence between the point given and 
the average. 

The next column, called Y(t), is obtained in a similar 
way from tables including the term involving j; the value 
of j is taken to be + *06 for reasons given below. The column 
following under calculated consists of the differences of 
the Y{t) column multiplied by 1,000; the numbers so obtained 
are the numbers to be expected approximately between 59 
and 60 inches, 60 and 61 inches, &c. The following column 
actual gives the actual occurrences per 1,000 in the same 
limits. The following column gives the differences in the 
various groups between the calculated and actual numbers. 
The greatest divergence is near the centre, where there are 
12 more than were calculated. In the last column are given 
the differences if I had taken the normal curve instead of the 
skew curve. It is seen that by taking the curve as a skew curve 
the sum of these differences is diminished from 80 per 1,000 
to 64 per 1,000. 

I have now a rather difiS.cult point to take with reference 
to one of those columns. Theoretically, j is calculated by the 
method of moments, the error of mean cube; but in practice 
that does not give good results. A single observation a long 
way from the average has a very great effect on the mean 
cube. So that if in this number of 1,985 persons we had 
included two persons from a nationality where stature was 
very low, or where it was very high, we should have instances 
at a long way along the group which would not properly 
vitiate the comparison of the curve of error, but would have a 



very uiifortunate effect upon the mean ciihe. Instead of 
having a lioiiiogeiieous group, we should have a group of 
l,9SS people from one group and 2 persons from another 
group which w’ould not belong to the same curve. There lias 
been a great deal of discussion as to wdiat should be done 
with such abnormal cases. A good way out of the difficulty 
is not to calculate j by the above method at all, but to 
calculate it by an u posteriori method, to choose that value 
of j which makes the misfit least. We have already chosen e 
so as to make the improbability less. Let us choose j by 
some similar test. The method I have adopted here is due 
partly to Professor Karl Pearson, and partly to Professor 
Edgeworth. It is to obtain figures (not given here) in such a 
form that it can be seen what value of j will make the sum of 
the absolute differences least. The value which satisfies this 
condition is found to be j = *06.* The value obtained from the 
moments method is *016. This might have been used and 
would have given a result slightly better than the value j = 0. 
But I am inclined to say it is better to calculate j from the 
a posteriori method; I think it is quite as logical, and you 
are bound to get a better fit. 

Professor Karl Pearson has given .a test by 'which you 
can consider the following problem:—Supposing you had a 
population with certain characteristics, such as height, 
distributed according to a curve with a particular formula, 
required the probability that an assigned distribution w’ould 
be obtained from the supposed distribution. Putting it into 
a more concrete way, suppose the equation of the height 
group for the whole population was this equation with 
c = 3*623 inches, and j =*06: required the probability that 1,935 
persons taken at random from the population would have the 
heights actually registered. Professor Karl Pearson has 
given a tablet with the necessary figures for determining that 
probability. Calculation from his table on this distribution 
shows that if we take the symmetrical curve the probability 
of obtaining such a selection is ‘4 ,* that is to say, the chances 
are two in five that the 1,935 persons would not be further 
from the supposed distribution than they actually are. If 
we take the skew curve with j = ‘06, the probability is *7; 
that is to say, the odds are seven to three that we should 

^ See Journal of the Royal Statistical Society^ June 1902, pp. 337-8. 

f See Lo'^Um, JSdin. and Duhlin Phil. Mag., July 1900, p. 175. 



obiaiii pereons as nearly coiifoniiiiig to this group as 

hive found. It is very flifficiilt to argue back from the 
heis’ht of a person to the expression (p-hqY^ and I shall 
n^it at pre>eiit attempt it. I have shown above that we 
should obtain this forniula of the curve of error if we were 
dealing with chanceSj with events whose occurrence was, 
by those terai-Sj in the binomial theorem. But the same 
eqaaiion will be obtained on very many other suppositions, 
and I have only taken the simplest. Before giving these, 
however, it is necessary to define a £rec|uency curve."^ 

If we are dealing with a group of measurements w'hich are 
distributed about their average so that the number of them 
which lie at any defined distance from their average, say 
between z and (z -f dz} in excess of it, can be represented by 
a definite function, say / (z), of that distance, then the curve 
ivhich represents this function, i.e., y=/(z), is the frequency 
curve of that group. If the unit of ordinate is so chosen that 
the whole area contained between the curve, the ordinates and 

its extremities, and the axis of z, is unity, then / yd£=l if 

5 


a and b are the limiting values of z; in many cases a and b 
are + x. Then if the quantity is selected at random from 
the group, the probability that it will lie between Xi and Zo 



ij.dx; the probability that it will lie between x and 


x + dx is y.dx. 

If we take the experiment I instanced at the beginning, 
the tossing of a coin, and make the number of times tossed 
very great, the chance of obtaining given deviations would 
be given by the curve of error, as already shown. This is the 
frequency curve for the group of experiments. Events are 
ruled by very different laws of distribution. We may have 
a very skew curve, as, for instance, in the curves of ages of 
wives in Yorkshire where the mode was a long way to the left 
of the average; the smooth curve which best fits those 
observations would be the curve of frequency for the ages 
of such persons. That is to say, if we draw this curve, 
representing as nearly as possible the observed facts, and we 
make this area equal 1, the area standing on the part of the 
axis between the 35 and 4^-year marks would represent 
the chance of a peraon taken at random being between 


43 


35 and 40 years old. If we were given tie age of a man 
who had a ^vife in Yorkshire and we did not know her 
age^ that area would represent the chance that her age 
would he between 35 and 40. The life curve, to take 
another example^ is a frequency-curve. To any frequency- 
curve we can assign a modulus calculated from the second 
moment. That tells one distinct fact as to the distribution 
about the average. The curve may have the greater pan of 
its area to the left or to the right of the average^ and 
it may have an asymptote as in the case of the curve 
of error; but there is in general only a small fraction 
of the area beyond two or three times the modulus, which 
may therefore he taken as indicating the practical extent of 
the curve. It is often useful to speak of the precision (h), 

instead of the modulus (c), where h=^. The greater h is, the 

more precise are the predictions that can he made as to a 
magnitude taken at random. 

If we are dealing with frequency-curves whose practical 
range is small and whose modulus is finite, and if we take a 
great number of these frequency-curves, or rather if we have 
to select from a great number of things whose sizes are ruled 
by different frequency-curves, for example, if we make up a 
liiie of a great number of pieces of metal taken from different 
heaps with different frequency-curves for each heap, it is 
possible to find the frequency-curve for the sum of these 
elements, that is for the length of the line you have made. I 
will put that in different form with a different illustration. 
Suppose we are going to take 100 books, and we can select 
them from 100 different groups of hooks whose thicknesses 
are bounded within definite ranges and have a different 
modulus which can be assigned, required the breadth of 
100 books put together. The most probable breadth will be 
that obtained by adding the averages of the 100 different 
groups. From the terms of the question it is obviously very 
improbable we shall get all the 100 below the averages of 
their respective groups or all above. The actual breadth will 
have a frequency-curve of its own about an average which is 
the sum of the averages of the groups from which you select. 
Its modulus can be shown to be the square root of the sum of 
the squares of the moduli of the original frequency-curves. 
Thus, to take a special case, if we are going to select two 



u 

yiilv wliieli t»bey iioniial curves with the same modulus, 
ilit* liioiJiik^ fyr the Slim is ^2 times the modulus of either. 
The developments from this theory are of great practical 

iiiiportiiiice. 

If we take uiie sample at random from each of a number 
of these fret|iieiii*v-ciirves whose moduli are not very unequal, 
so that !i«i one curve predominates, and add together the 
«|iK.i!itiiies so obtained, then the quantity obtained obeys the 
curve of error itself, whether the original frequency-curves 
were carves of error or not. I cannot give the proof here; 
the theorem as I state it is partly due to Laplace and partly 
due to Professor Edgeworth.* That is one of the most general 
statement> of the cases in which the curve of error will arise ; 
and that conception may properly be applied to the conception 
of height and the causes which determine the persons" height. 
Xo single cause has very great influence compared with others, 
so far as we know, and they all presumably have measurable 
effects whose frequency-curves are definite. Thus, we might 
expect #> priori the frequency-curve of heights to be the curve 
of error. 

Another illustration is supplied by the grouping of school 
children in a particular grade.t I took one of the most 
populous grades in the Keport of the St. Louis Public Schools, 
U.S. A., grouped the children according to their ages, and fitted 
the curve of error by one of the methods I have described. 
The curve of error with c=l‘68, j = -073, fits the observations 
closely. If we think of the causes which determine the 
position of a child in a particular grade or class, I think we 
shall find that they are akin to those I have supposed in 
my statement as to causes which lead to the asymmetrical 
curve of error. But it would be absurd to go back and try 
to re-value p, q and the quantities on which the algebraic 
proof of the equation depended. We could find out, of course, 
what chances would produce this particular distribution; but 
they would have no necessary relation to the facts. The idea 
I wish to give is that we can obtain the equation of the curve 
of error in the form I am using it on a very simple supposition; 
and it can be obtained from many other suppositions which 
cannot he given in lecture work. 


* See Edgeworth, in Lo»d<m, JEdin, amd Duhlim FML Ma^., 1892, p. 429. 
f For tke nnml^rs and diagram, see Elements of Statistics, 2nd Edition, Appendix. 



MEASUREMENT OF GROUPS. 


FOURTH LECTURE. 


The Method of Least Squabes. 

Suppose tliat we make a gi^eat many measurements of the 
same quantity by several different methods; and that^ as is 
generally the case^ the ineasuremeiiTs differ from each otlier^ 
owing to imperfections of instruments^ or by the numerous 
accidental circumstances that attend any involved observations. 
Let us assume that the measurements which could be made hy 
the first method are grouped according to the frequency-curve 


those hy the second method according to 


y= —— e and so on, a definite normal curve for each 

Co V TT 

method. Suppose we make n measurements, one of each kind. 
It is required to find what is the most probable value of the 
distance to be measured. All the ^/s we are dealing wdth 
are errors in our measurement. From the series of partly 
erroneous measurements it is required to find the most probable 
value. That is the problem it is attempted to solve by the 
method of least squares. As a second question, it is required 
to determine the precision of the result, that is, to state 
the probability that it is correct within assigned limits. 

Before going further I must call attention to one very 
important point in the reasoning. In the reasoning on which 
the method of least squares is based it is assumed that the 
frequency-curves are normal curves of error, as written above. 



46 


If the frequency-curve is not a normal curve of error the 
method breaks down at the first step. That I shall have to 
return to later. With regard to the moduli^ we may either 
suppose that we know them by some a priori method, as is 
sometimes the case; or that we know them by having made 
similar experiments at some other time, e.p., if we are dealing 
vritli a group of height measurements where the modulus is 
three inches generally; or we may find them from the 
experiments themselves. A useful way is to repeat the measure¬ 
ment by each method, say, 100 times, and from the internal 
evidence find out what the moduli are. We assume that the 
moduli are fixed quantities, quantities which we cannot affect, 
and that they are known or previously determined quantities. 

-♦What is the probability that a certain series of errors should 
result in n observations^? Let £^ 2 , be the differences 

from the unknown true value which arise from n different 
methods taken in one series; what is the probability that 
those particular ?i deviations will occur at once ? The proba¬ 
bility is obtained by multiplying together the probabilities of 
their separate occurrences. The probability of the error Xi 
occurring, when the modulus is Ci, is from its curve of 
1 

frequency- e cr. The probability that the n will all 

Cl V TT 

occur is obtained by multiplying n such quantities together, 
1 - 2 - 

that is, - e c -. Here the only variables are 

TT^. C1C2 . ... Cn 

the x’s. hfow, that probability will he greatest when the 
index of e is greatest, that is when S-g is least. Thus, from 


all the possible values of the unknown true measurement, the 
system of errors which we have found would arise with the 

least improbability when is made the least possible. 

That is the statement which is at the basis of the method of 
least squares. In the particular case, when we take all the 
the observations by the same method with the same curve of 
frequency, so that c is the same for all the observations, the 
minimal condition is satisfied when the sum of the is a 
minimum; and we have already seen that that sum is made 
least when the unknown value is taken to be the arithmetic 
average of the obtained values. Let me re-state this theorem 



47 


in otlier words. Suppose we start to measure a partieiilar 
object by the same method again and again. Then, ilie 
measnremeiits we obtain would come with the least iniprolia- 
bility when the sum of the squares of the deviations i.s a 
minimum; and that condition is satisfied if we take the 
arithmetic average of our measurements to be the uiiknown 
true quantity. This statement is a particular case of the 
method of least squares. 

When ive have grasped that initial principle, the rest of 
the investigation is only a matter of the differential calculus; 
there is nothing special about it. We have to write down all 
the equations that connect the quantities we are measuring, 
and then by the ordinary processes of the differential calculus 
express the conditions that the sum of the squares of the 
errors shall be a minimum, and these will give enough 
equations to solve for all our unknowns. I will illustrate that 
algebraically by a particular case. Take the case with which 
we have already dealt, namely, that in which we had the ages 
of the wives in Yorkshire. There we obtained a somew^hat 
irregular curve representing the numbers at different ages, 
and we smoothed that curve by putting parabolic curves of 
the fourth degree through various points; and it will be 
remembered that we had to change the constants in our 
equation according to the particular group of five points 
selected. Now let us assume that we have a parabolic 
equation of the third degree in this form, 

^=a© + 

This equation has four unknowms; w'e can therefore make it 
pass through any four assigned points, hut we cannot make it 
pass through five assigned points. Suppose that we wish to 
determine an equation of the third degree which will pass near 
the five points, then we will apply the method of least squares to 
that problem. Let the co-ordinates of the actual observations 
be (»i, mi), {x2, m2), and so on. Let the corresponding points 
which we are to find on this particular curve be («i, ^1), (^,1(2), 
and so on. The point (xiyi) will be near, but probably not 
coincident with, the point (aJimi). The difference between 
mi, the observation, and 1/1, which would be given by the 
curve which we have not yet determined, is the error of the 
observation. We are to determine the constants so that the 
sum of the squares of those errors shall be least. Writing 



48 


dial a little mure fiillj, and siil3Stitiitiiig for y in terms of 
we have that 

i all — '(1% — (I l-f 1 £I-2^I (l2pC'^ “ 

is to he a iiiiiiimiini. 

In that expression the variables are the four a^s, which 
have to be determined so as to make the expression a 
miiiiniiiiii. Therefore we must differentiate that expression^ 
when it is written out, wnth respect to Qq, Ui, and 

equate these partial differential coefficients to zero, obtaining 
as many equations as have unknowns. Then we have to 
solve the equations so obtained. 

After a little simplification the following equations are 
obtained : 

5. £!«-f . Cl + . (X2 + Sct/. ^^3—2m=0 

. a§+2tf r • di + 2a’2^. % -f 2.rd. Ua—2m®=0 
2®!“. % -f 2® . a 1 -f 2cr3d 02 + 2® d • % 2m®-=0 

. Oq + 2®!*^. ai -f 2®/. (i 2 + 2®i®. 03—2m®^=0. 

The chief thing I want to say abut these equations is, 
that they are so complicated, and a solution is so laborious, 
that they must be put- out of court for all ordinary 
calculations. If you wish to construct a new table which will 
be of some general use, it may be -worth while to go through 
the solution, but not for any single practical piece of work. 
Every one of those separate terms, 2®i, &c., have to be 
calculated arithmetically, and the equations have to be solved. 
Even in this simple case we have four equations each containing 
four functions. In Merriman^s Method of Least Squares, 
the simplest methods for that evaluation are given. Many 
terms drop out, and the evaluation is possible; and in some 
cases we can so choose our origin and take advantage of 
certain points of symmetry in the equations, that the work 
can be simplified. In this particular case a simple solution 
has been ^ven by Professor Darwin.* 

FiTTIHG FOEMULiE TO ObSEEVATIONS. 

Before we look for another way, let us consider again 
whether the assumptions on which the above method depend 
are justifiable, or will justify the great effort which would be 

* See Darwin, “On EalliMe Measures,” Zondon, JSdin. and BuUin Fhil. 
Maff., July 1877; used in Elements of Statistics, pp. 256. 257. 



iwL^e^^ary to >o!ve ilie ei|iiatioii-. i think it will Ih- fioiiiil 
that ill gent*ral tliej do not. If we Ic^jk ijuck iliroutrli iLr 
arinniieiit^ it will be >eeii iliat tlie original u-^^iuiqavm h tliat 
tile difereiiee between the actual iiiiiiilier *4’ per-^ui- 
okserred, and tlie iiniiiber obtained friuii the eqiiaiii 
l}eluiig> to the normal curve of frequency.; and in every 
ease wdiere the method of least squares applies wt* have an 
observed measiireiiieiit^ and we obtain a tlieuretieal liieiisiire- 
iiient^ and we assume that the diifereiice between the two 
belongs to a normal curve of frequeney. Before we can 
make that assumption we must verify that the eoiiditiuiis., imder 
w'liicli the normal curve of frequency is obtained^ are satisfied. 
We are not in a position to do tliat^ if vre depend only on the 
algebraic proof given above^ without iiivesiigating the 
deductions of the equation of the curve of error resting on 
other hypotheses- Bat to my mind there is no proof vet 
given which does show that the normal curve of error will be 
obeyed in the circumstances I have just mentioned ; and 
Professor Karl Pearson has shown that in very many iiistaiices 
the normal curve is not obeyed. So the theory is at any rate 
difficult to establish d priori^ and is not supported by universal 
experience. I thinks with all the deference that is due to 
Professor Karl Pearson^ that the matter yet wants more 
practical experience before it can be fully decided. It would 
he unsafe in the present state of the argument on the one 
hand to say that the normal curve of frequency may be 
expected; or on the other hand to say definitely that it is 
not to be expected^ because it has not been universally 
found. That is too difficult to deal with at ail thoroughly 
here. The reason I have gone so far into it is this: if the 
method of least squares is very difficult to applyq and if it is 
neither supported sufficiently by theory nor by experiment, 
then it seems expedient to try some other method. A purely 
empirical method would be this: Instead of making the sum 
of the squares of the deviations a minimum, make the sum of 
the first powers of the deviations, all reckoned as positive, a 
minimum, that is to say, remove the square outside the 
bracket in the expression on p. 48. But it is not at all easy 
to make that sum a minimum, because ail the terms have to 
he taken as positive, and we do not know until we have finished 
our work which terms are naturally positive or which terms 
are negative. Professor Edgeworth has given a method of 



50 


liie .eolation when there are only two iiiikiiowiis * 
Wilton there are three unknowns I believe there is as yet no 
practical solution. 

Another method, still taking the method of least squares 
as the basis, bat avoiding* the very complex solution, is to 
choose the coefficients, so that the curve will pass through 
exactly the four points assigned; and then re-calculate them, 
>o that the carve shall exactly pass through four other 
assigned points ; and so continually calculate again and again 
the coefficients, getting a series of curves. Then from the 
various values of the coefficients so found, choose those 
coefficients which appear to give the best results. It is really 
a makeshift method. I think it has been often employed, 
and the results have been very satisfactory. If, by one method 
or another, you get coefficients which make the theoretical 
curve pass near the original curve, it does not matter by 
what process you have got them. Such a method as that, I 
think, is in general use for approximating to the population in 
inter-censai years. I think the Census Office has never 
published this method; but as far as I can find out, the 
method employed is as follows: Supposing certain points 
represent the population at the various dates at which it is 
exactly enumerated, then if, as a first hypothesis, wm assume that 
the population increases in geometric progression betw’een two 
enumerations, we obtain a simple curve passing from one point 
to the next. Then assume again that from this Census to the 
next there is another increase in geometric progression, and we 
find that the two curves never have exactly the same constants. 
Then obtain some method for passing from one curve to the 
other without a sudden break of curvature, reject the parts of 
the curves near the Census years, and replace them by a curve 
which gradually passes from one to the other. That is a purely 
empirical method, and I think it is the one adopted. It is in 
some such way as this that we can go to work if the method of 
least squares is too complicated. 

The third method, to which I wish to call attention very 
particularly, proceeds in quite a different way. We tabulate 
our observations as before, and write down the equation of a 
curve which is assumed to fit them, with unknown constants ; 
calculate from the observations the moments—first, second, 

* See ikigewortli, ‘^On a New Method of Kedncing Observations,^’ JPML 
Mag, 188®; in Jourml of Mogul Statistical Society, June 1902, p. 341. 



51 


thinlj foortli (as inanT as there are unknown- —oibioit ihe 
centre of gravity, by tlie method used above, and ealeiilatt* 
the iiiomeiits from the assiimeil curve in teri!i> nf tlic 
liiiknowiis. E€|uatiiig the moments found from the «jlj,>er\Tit’e ?iis 
with the moments foiiiicl for the assumed curve, we have ihv<e 
ef|iiatioiis deteniiiiiiiig the constants. For example we may 
take the instance already discussed, when we found a skew 
curve of error to fit certain observations. Tlie general 
equation to the skew curve of error being given, by the help 
uf the integral calculus ive stated the values of the first, 
second, and third moments in terms of c and J ; we ecpiated 
these to the moments calculated from the observations, and 
thus found e and j. 'We need to calculate as many moments 
as there are unknowns in the particular equation selected. 
For instance, in Makeliani's formula there are four unknowns, 
and we have to take four moments. In the iioriiial curve of 
error there are two unknowns, its centre and the modulus; 
two moments are therefore sufficient to find the iiorinal curve 
of error by this test. In the skew curve uf error, the quantity 
j has to be determined in addition. In the empirical equations 
given by Professor Karl Pearson in Iiis •well-known paper on 
the measurement of skew groups, -which was published in 1895 
in the Proceedings of the Royal Society, there are four 
unknowms, and therefore in general he needed four moments. 
In the parabolic interpolations, such as I have used in these 
lectures, there are as many unknowms as we like to take. If 
we stop at we need four moments. In Professor Pareto’s 
empirical equation for the grouping of the incomes of the 
people of a country there are two unknowns. The 

A 

equation is as follows: y = —, where y is the number 

of persons in receipt of income and A, a are constant. 
It is also given in a developed form with one more 
constant. It is supposed that the index a is nearly the same 
for all countries, while A varies from country to country. 
You could obtain those values by the principle of least 
squares, or by equating moments. This is not the place to 
criticise the equation: I only give it as an example of 
algebraic equation for statistical grouping. We see then how 
to obtain sufficient equations for the unknown constants, 
and so we come naturally to the question of what is the 
justification for this method. I think I must refer you, in 

E 2 



g'eneral, to Professor Karl Pcarson^s paper for the justifications, 
beeaii>e it is his nietliocl, arol in particular lie lias quite recently 
piiHislietl a paper in the journal going very 

carefully into this whole method; and all I can do is to simply 
follow in his steps. The method depends on a purely empirical 
basis, not cm any imrrn theory. By its means ive do, as a 
matter of fact, obtain an equation which fits the observations. 
But, ineideiitallj, Professor Karl Pearson shows that the results 
obtained are, in general, the same as those obtained by the 
method of least squares. Without basing his system upon 
the coincidence at all, he does obtain the same results. The 
advantage of the method is, as he has also shown in the 
same paper, that the solution of the equations obtained is 
very much easier than the solution of equations obtained by 
the ordinary method of least squares. I hesitate to go further 
into this subject because it is Professor Karl PearsoiPs subject, 
and all Ms papers are very easily accessible. He has shown 
that empirical algebraic formulae can be found for a very 
wide range of groups, and in every case he has fitted equations 
to the groups by the help of this number of moments. He has 
then found that the equations so obtained do fit the groups 
exceedingly well. Groups may, perhaps, contain 30, or 40, 
or 100 measurements, but the constants at disposal are only 
4. If you calculate these 4 constants by any method and 
obtain, as a result, the equations which fit a wide range of 
observations, you have a strong empirical justification for the 
method. I believe that is the justification which Professor 
Karl Pearson gives for the method. But we are met face to 
face with this difficult question, which it is impossible to deal 
with here and now : How far ought we in such investigations 
to take empirical formulse which are only justified by their 
results, and how far should we base our reasoning on h ‘priori 
assumptions as to the nature of error, and as to its occurrence, 
assumptions which underlie the theory of probability, and 
from such assumptions obtain our equations? Should we 
obtain our equations with the view to fitting the result, or 
should we obtain our equations from h priori reasoning and 
STO how far they fit the results ? To my mind we have not 
nearly enough experience in the matter at present. We have 
not sufficiently tested the fitting of groups to the cu priori 

* Biometr^a, April and Decgsal^r 1002. 



eqnati^jii?, nor liave wv yet siiifieieii! vxjjvrltYiev set tiiai 
rile empirical iiietlioii h iiiiiver-ally >ati>facrory becaime it lias 
been feiiiitl tti tit wide raiige< of ”"r«Fap>. At tliat poiiit 1 
liiii-t leave ilie 

Uses ur ihe Cuhve ur Eimom 

Whatever may be tlie nitiniate decbioii in the questions 
wliieli I have tliiis ^ratecl, there are certainly many uses for 
tile curve of error in the form in wliicli I gave it in the last 
lecture, quite independently of the discussion we have just 
been engaged in. In wliat I have been recently saying I have 
been following, as far as possible. Professor Karl Pearson's 
iiieiliod. In wliat I shall say now I am following Professor 
Edgeworth's work. I do not mean that the two are 
conrradictorj in any way; I wish to indicate that I am 
trying to summarize the present position of this question on 
the lines of the two most eminent authorities in this particular 
w’ork. For clearness, I repeat the method of generating the 
curve of error given on p. 43. Suppose we have a iiumlber of 
frequency-corves, each of small and limited range, that is to 
say, of great precision, its modulus being small; let the 
moduli of n such curves calculated from the squares of the 
deviations he Ci, C 2 , . . . e». The curves may be of any shape, 
except that no finite part of their areas may be at a great 
distance from their centres of gravity. Suppose we take % 
observations belonging to the first curve, out of the 
second, and so on, and add them together; the curve of 
frequency for the resulting sum is the normal curve of error 
with modulus If instead of taking the sum, w'e 

take any other function to which the sum is the first 
approximation, the curve of frequency for the values of 
this function is likely to approximate to a normal curve of 
error; but we will here limit ourselves to the sum. The 
following diagram and the experiment on which it depends 
illustrate this theory. I took Chamhers' mathematical tables, 
and chose three digits at random and took their average, and 
repeated this a thousand times. The curve of frequency of 
the 10 natural digits is a straight line; you are as likely to 
get any one of them as any other, if you select a suitable part 
of the tables. I have represented that curve of frequency 
by ten dots. It is limited at both ends, its modulus is fairly 




DrAoiiAM IX. 

Fmiuciify liiu* ft»r llii‘ illicit m 1) . 


54 



X 



Ill If, viz. : 4-OG, and it supplies a verv sevi'-ro test f*f tlie 
principle I have enimciated, because we have a curve of 
frequency which is absolotely different fn^ii the iitiriiial 
curve of error; it does approxiiiiate tH it in jiiiy way 
whatever. The actual probabilities of the occurrence of 
various numbers are the successive coefficients in the 
expansion of (11^000. Comparing' these 
with the result of the experiiiieiii we have the following 
table:— 


Xu. liF TIMES THIS AVLE^UE 


Avcra,;:e of 
S diiiits taken 
at raRdcmi 


0 


1 

It 

i| 

2 

2-1 

2I- 

3^^ 


U 

5 

5^ 

5|- 

6 

6i 

6t 

7 

7i 

n 

8 

8 | 

9 


Was actnally 

fi'iiind 


0 

4 

11 

10 

14 
17 
26 
33 
53 
48 
60 
82 
76 

72 

73 
75 
61 
65 
60 
35 
35 

29 

30 

15 
3 
6 
7 
0 


Miylit be 
exjiected eTcrj 
1,(XM) tiraes 


1 

3 

6 

10 

15 

21 

28 

36 

45 

55 

63 

69 

73 

75 

75 

73 

69 

63 

55 

45 

36 

28 

21 

15 

10 

6 

3 

1 


It is not my point here to show that those figures are what 
you would expect to get; what I wish to show is, first, that 
the successive probabilities, when they are plotted out, 
resemble the curve of error; and, secondly, that the experiment 
tends to fit a normal curve of error. In Diagram IX the 
continuous line with dots on it is the frequency which yon 
would expect. The broken line is the curve of error, with the 



-all:- aivii and inodiilu>, and the crosses are the positions 
Miiiaiiiril ffiaii ihv actual experiiiieiit. It is seen that 
lliioiirli with a freiiiieiic/ciirve wliieli was a straight 

lint*, thai ihr ! lie* a*etica] curve which we obtained for the 
aveno/e nf tiiilv three terms selected from it is already* so iiiiieh 
like li ciiFVt^ of emr that you would mistake it for one^ if a 
iiioilel was not traced on the paper; and that the actual 
experiiiieiit "iipports the same view. 

W 'e note that the modulus calculated from the squared 
t!eviatioii> fur the natural digits is 4*06^ and that from the 
furiimla y\lrrV-; given above the modulus for the sum of 
three iligit> should be x (4*1)61-= 7*032^ and for the average 
of three dibits should therefore be 2*344. The modulus of 
the curve given by the calculated probabilities of the various 
numbers is 2*345, tvliile that calculated from the results of the 
experiiiierii is 2*358. The averages are 4*5 (theoretical) and 
4*494 (experimental).* 

COXSTEUCTION OP A GeOUP FEOM SaMPLES. 

The theory which I have just enunciated, for the proof of 
which see the reference given on page 44, is, that if we start 
with any frequency-curves, and take our examples from them, 
one from each or many from one, and take the average, we 
shall obtain a curve which becomes more and more like the 
curve of error as we extend the number of our examples, 
and as the frequency-curves satisfy more and more nearly the 
limited conditions which are laid down for them. Now, that 
is not only a mathematical theory: it has very great practical 
importance. Supposing that we take a number of samples 
out of a large group, how near the true average may we 
expect to get ? If the curve of frequency of the group was 
a curve of error, ^Ye can at once write down the probability of 
different divergencies. If we have a curve of error with 
modulus c, and we select n samples at random from it, and 
then take their average, the modulus for their sum is from the 
formula already given, and hence that for their average 

c 

is y=. The precision of the arithmetic average varies inversely 

as the square root of the number of items, a very well-known 
principle. I wish to show how this theory can be adapted to 

* See alK> JSdgem&rtk, in Jubilee Volnme of tbe Jommal of ihe Rogal 

Siaimiwal Simetg, p. 186 . 





riirves of frequency other than the normal curve ef 
Suppose the original curve of frequency Im any curve 
wiiciteveiq a curve of >iirvi\"ur> for example, I dii iHii a''-unit/ 
any particular shape to it. Suppose we tlirMiitrli an 
experiiiienr^ takings we will say^ m c*xairiple> at ntiidoxii fi-iiiH 
it, and repeat the process k times. “In the experiiaeiiT 
just discussed in w^as only three, and I: was 1<}00.] Tliouitli 
the original numbers do not obey the normal curve of error, 
yet the average of m of them may be expected to, ivhen m i> 
sufficiently gi’eat. Let c be the modulus for the group (d 

c 

averages of m samples; then may be expected to be the 

modulus for the average of the whole mass of km samples. 
Thus, in the above experiment, c was 2*35, k 1000, and 
e 

^==•064; the known average for all digits, which formed 

the original curve of frequency is 4*5, the average for the 
3,000 selected, in 1,000 groups of three, wms 4*494; the 
difference is one-tenth of the modulus just calculated; so small 
a difference might be expected once in nine trials. 

Thus, wdiether the curve of frequency of the original group 
is the normal curve of error or not, the precision of the 
average of a great number of samples is proportional to the 
square root of that number. 

Xow let us see how to construct not merely an average, 
but a whole group, by the method of samples. 


Gazette Trices of Wheat per quarter. 


\ 

j 

(1) 

(2) 

(3) 

(4) 1 


Price j 

Xo. of 

Frequency 

100 

j 1 

Frequency i 

25 


Cases 

: per 100 

Samples 

, J»er 25 

Samples 

j 

Under 20 - 

3 

0 

! 0 

0 

0 

20 - to 30- 

111 

18 1 

19 

1 ( l(j 

8 

30- „ 40 - 

134 ! 

21 1 

16 

) 

O J 

40,-„ so- 

206 ’ 

32 

30 

' 12 

13 

so'/- „ 60;- 

lOS 

17 1 

18 

i ) 

60/- „ 70/- 

47 

7 

11 

1 3 

3 

70/- 80/- 

26 

4 

4 

i ) 


Above 80 - 

4 

1 

2 

; 0 

1 


636 

100 

100 

i 

25 1 

Average 

43,9 


45/4 

i 

46 6 1 





59 


The taljle and diagram give tlie result uf an experiiiieni in 
siicli constriicrioii. The material uf the ex]ieriiaenr uf no 
importance here; I merely rook the most aeees-ihle tiiriirt- 
crmdiict the experiment, iianiely, the official ( 11121*110 |sriet-^ *,1 
wheat for the 636 months for which they are rec^ji-deil in tIh* 
statistical abstracts, and regarded that as a group 
which I was going to bnild up by sample. Ftir cuinplete 
illustration I had to take a group I kiieiv, and then to take 
samples of it. In general, of course, the group is not known, 
but has to be constructed from the samples. The actual 
group is that given in Diagram X in the coiitiiiiious lines. To 
obtain the samples, I took Chambers' iiiatheniatical table<, 
and assigned to particular numbers, from 001 to 636, 
certain months, and took 100 numbers of three digits at 
random. Xext, I wrote dotvn the prices in the 100 months 
corresponding to those 100 numbers, and grouping them in 
10,s-. groups, obtained the numbers given in the third column 
above, and also given by the crosses in the Diagram X (ai. 
I next selected 25 samples by taking the first 25 of the 100, 
and I grouped the figures in '20s. groups, and obtained the 
numbers given in the fifth column and by the crosses in 
Diagram X (b). What rule have we for deciding how near 
the true group the sample is ? In the third division, for 
instance, between SOs. and 405. in the whole group, there are 
134 instances, and 21 per cent, of the area is betw^een 30x. 
and 405. If we take 100 things at random out of the wdiole 
group, how many of that 21 per cent, are we likely to get ? 
This is a simple problem in probability: if qi samples are 
taken, the chances that 0, 1, 2 . . . n will come from a given 
part, which is to the whole as is p to 1, are the successive 
coefficients of the expansion of where q = l—p; as 

71 increases we approximate to a curve of frequency wnth 
modulus \^ 2 pq 7 i (see p. 34). In the third division p = *21, wffiile 
w, the whole number of samples in the first experiment, is 100. 
Here y2pgn= \/(2 x *21 x *79 x 100) =5*8. The difference 
between the actual number per 100 in the group, namely, 21, 
and the number found in the sample, namely, 16, is less than 
the modulus. In all the other cases in both experiments the 
differences are within the probable error" (which is *47 of 
the modulus, see p. 36); We have thus found a criterion of 
the divergencies to be expected between the distribution of 



in a group of samples and the distribution in the 
iiiikioovii irroiip from wliicli they arise. 

A - rpgard> tlie preci-^iuii of liie averages of the samples^ the 
iiiitfliilii'- the original group is about and, therefore, the 
iiioiiiili for the averages of 100 and of 25 samples, respectively, 
are 19 s, \dy0=l.^. lid., and 19^^.-e- ^25 = 3^. lOd. The 
averages found from the samples are actually 45s. 4d. and 
4fjs. 6d. which are, respectively, 1^. 7d. and 2s. 9(L in excess 
of the average of the wdiole group. 

The experiment, therefore, forms a good illustration of the 
theory, and on consideration it will, I think, be found that 
the theory is in strict accordance with common-sense and 
cominoii experience. 


MEASUEEMENT OF GROUPS. 


FIFTH LECTURE. 


Correlation between Two Groups. 


JjET there be n pairs of measurements {ttiiji) and so 

on up to (j^nyn)} the members of each pair having some 
determinate connection with each other; for example^ suppose 
that the are the ages of the wives in the group taken 
above^ and the y^s the ages of their husbands^ Xr y^ being 
the ages of a married couple. This is the example discussed 
below. Or suppose that Xr is the age at which a man dies, 
and yr the age at which his father died j or suppose that yj. 
are measurements of physical characteristics of the same man. 
Or again, Xr might be a death rate, in a year in which yr was 
the average temperature. It is required to measure the 
relationship between x^s and ^^s so as to answer this question: 
Given one of the x^s, assign the probable value of the 
corresponding y. For example, given the age at -which a man 
died, assign the most probable age to which his son will live. 
Or, taking one member of the group of wives at random, state 
the probabilities of the age of her husband. We have in fact 
to give numerical expression to such statements as these: 
A high death rate goes with a low temperature; a long-lived 
father has long-lived sons; for two statements where 
two measureable quantities are connected in that way, where 
in common parlance we connect them with simple adjectives, 
we have to find a numerical or mathematical expression for 
the relationship. First suppose that there is no causal 



»" rn'ivi’iljh hvtwvvn iwu irmiqK. Tlieii if we select any 
yartieinar place nii tlie axis <»ii wliicli the are measured, 
liiiii imirk ill !lie cuffl-is aiding yV, we sluill get a group of i/\< 
wicj-'ic liVeratfe is s'piialiv likely to be above or below tlie 
averaii'e nr all ilie Suppo>e we eliou>e a group of wives, 

betwetoi tlie aLO:.- 2b aiid dO, mark in the ages of tlieir 
liii-biiiiij-, and mark tlie average of suck ages, if there is no 
t^jiiriectioii be! ween the ago of the one group and the ages of 
the Mtlier, the avera.tte of the group so taken wdli be near or 
es|ual t-o the average of tlie whole grouji of husbands, iiaiiiely, 
42 year,-. Ami so, if w^e take another period and mark in the 
variuim an^es of the liasbaiids w’e should again find the average 
near the average of the whole group. If the are 
rt'piv-miteJ uii a horizontal axis, and the i/< are measured 
vtutieaily by puiiits placed above the values of .r which are 
tlieir pair-, then it there is no causal cuiiiiectioii between the 
mairniiiide id the .‘/V and of the p\s, the averages of groups of 
tile y”- currto-poiidiiig to a.--ig!it*d intervals on the axis of c/; 
will all lie near the liurizoiital line through the averages of 
the if>. They will not lie on it, but the best straight line ive 
can draw* near these points wdli be a horizontal line through 
the aveKige; that is obvious as soon as the statement is 
understood. 

But iio^v suppose there is a causal connection between the 
two sets of measurements; suppose, for example, that a high 
value of .f goes with a high value of y. Then if we start from 
the average value of ay wdiicli we may assume for the moment 
corre.-poiids to the average value of ij, and pass to the right 
and choose a group at a place above the average for the a’’s, 
the ys which are obtained for that group will be distributed 
about an average above the line. And as we continually 
mark off the averages for group after group by points, they 
will lie on some curve which teiids‘upward to the right from 
the origin and downwards to the left. (See for example 
Diagram XL) If, on the other hand, a high value of £ went 
with a low value of y, there is a change of sign ,* the series of 
averages w'ould go down to the right and up to the left. The 
exact method of drawing a line through these points I do not 
propose to discuss very minutely. We could draw a smooth 
line by the methods discussed in the first lecture, or a 
freehand curve. We can either draw a straight line as near 
as possible to the dots, or w^e can draw a curve. I shall 



m 


not ciiscii.<> the general shape of that curve; I >lia]] inervlv 
assume that^ from tlie observation or otiierwi-e, cun tlniw 
that curve. And since in any serie< of oli>ervatioii- ilic 
particular averages are liable to slight displaceiiieiit. in a 
finite number of observations we do not get the most inTiliable 
point with each average^ and must smoutli the line in tlie way 
we have discussed. We may assume an equation, 
which gives the average of the y's for the particular values of 
a’; that is only giving a general form to the statement, that a 
value of y is connected with a value of by a deteririiiiate 
equation. 

This equation, of course, only gives the position of the 
averages of the selected groups of i/’s. Everyone of these 
groups has its own frequency-curve. If we select again the 
ages of the husbands of those wives whose ages are lietweeii 
25 and 30, we can draw’ a frequency-curve lor that group of 
husbands, but the centre of that frequency-curve will no 
longer be at the average age of all the husbands, if there is 
causal connection between the groups; but as the group taken 
is below the average of the wives, the centre of this curve 
will be below the average for the husbands. It is not 
necessary, in general, to make any attempt to draw this 
frequency-curve point by point, but only to take its centre 
and in some cases its modulus. Instead of dealing with 
arithmetic averages, we may equally well use the medians of 
the groups. 

We might take, for example, such a question as this, a very 
old question: Has the price of w’heat anything to do with the 
marriage rate ? In such a case as that we plot out the prices 
of wheat in different months or years along the axis of x, 
and put in ordinates showing the average marriage rate when 
the wheat was that particular price, and the direction of this line 
or the form of this curve would give, within certain limits 
dealt with below, the answer to this question, w^hether there 
was a connection between the two or not. If we do obtain 
from our observations that there is a tendency upwards to 
the right and downwards to the left, or vice versa^ we have 
found that there is something common in the system of 
causation which produces the two sets of phenomena. We 
cannot say that the are the cause of the nor vice 
versa, but only that the two phenomena are not absolutely 
independent. 



r4 

The Coefficient of Coeeelation. 

\\T have Ij Slid a iiiiniL'‘rica! measure ef that depei 
If tile curve that we is a straight line, we have < 

Slid a means of calculating its incliiiaiioii. Before proe 
to tlrh; let 11- speiiil a few words on the case when tlit 
i> not a line. Suppose that we have sii 

<jli>ervatioii- to (leteniiiiie by experiment and observat 
actual >liape of this curve from large groups, wt* 
without applying any further theory whatever, establi 
coiiiieetion between the a's and the y’s; the curve « 
plotted out, and given algebraic expression, if possibli 
then ive should be able to say that for a particular val 
the most probable value of y was the one obtained c 
curve. We could have a curve simply from experieiic 
use the experience with similar phenomena at aiiothei 
For iii>taiice, if we had that experience uf the length 
lives of the children of parents wdio lived to various ag 
sliuiild be able from this empirical curve, to say if a 
father lived to a certain age then the chances of the 
tile son are given by a frequency-curve whose centi 
found from the empirical diagram, and whose shape 
very likely be known also. In many cases, howmvc 
curve of averages is approximately a straight line. E 
the approximation is not very exact, it may he use 
calculate the inclination of the straight line that 
nearest the averages. Let us suppose that we ha^ 
equation of this line, i/=(LV + h, Consider any obsei 
; if this observation lay exactly on that line, i/r wo 
If the observation does not lie on the lb 
distance from it, measured parallel to the axis of 
y,.—(d^Fr + b). To obtain the best values for a and 
are the only unknown quantities, we can proceed* ' 
method of least squares, and make the sum of the s 
of such quantities as («,.q-6) a minimum. Thi 
differentials of 2(^r—(say) with regard t< 
a and h must be zero. 

Thus = 2aXx^—2Xxi/ + 2hXx =0, 

^ + 2a2>x - 2Xy=0. 

* gkje below p. 73. 


Choose the axes so that both the .r’s ami tlic i/s are 
measured from their averages, then the 

~pii 

equations give us h = 0 and a=r^; the line required passe- 

through the origin, and its equation is '-fu Let cr 2 ,cT .2 

be the standard deviations of the groups of ri’^s and of y\ so 
that and let then the above 

1141 i{r^ 

equation becomes— 


HCTj- 


0-1 


that is, — 

0-2 CTi 

In order to make r symmetrical, it has been necessary to 
divide by o-i and o-o, that is, to measure .r and y by their 
standard deviations. It is a very natural thing to do. Before 
we can get any numerical comparison, we must reduce them 
to some common measure, and a common unit which we can 
very reasonably adopt is the standard deviation for each of the 
two things. If we are dealing -with the question I suggested 
Just now—^the marriage rate and the price of wheat— we 
cannot compare shillings with a rate per thousand, but we can 
compare a ratio of the number of shillings to a standard 
number of shillings, with the ratio of the rate per thousand to 
a standard rate per thousand. We are then comparing 
absolute instead of concrete quantities. We should get 
similar equations if we used the modulus instead of the 
standard deviation, or the probable errors, or the mean 
deviations. For rapid work we could replace the o-j and 0-2 
by the probable errors, which are proportional to the standard 
deviations in curves which approximate to the curves of error. 
It is to be noticed that we can express the quantity r in the 
following form: r is the average of such products as 

. r is called the coefficient of correlation. It is not 

CTi 0*2 

difficult to show by pure algebra that the quantity r so 
determined must lie between +1 and — It; and that r equals 
+ 1 , only if the ratio of every x to its corresponding y is 


^ The last few paragraphs are substantially the same as those given by 
Mr. Yule in the Journal 0 / the Jioyal Statistical Society^ 1897, p. 817 seq. 

t See Blements of Statisticsy p. 319- 



identieaHy the same as tlie rati<j of every otlier 
t*orre^|)uii<lia«/tliat tlie ratio 1 /ar is constant^ and 

If the ratio is coiistioit and r liecoii 

C| 

aiiil an increase rsf ? corresporids to a diiiiiiiutioii in 1 /. 
i> always l}eTwa?eii +1 and —1, and between tliei 
there is a scale of correlation. For instance^ tve can 
the correlation between two sets of phenomena is *6 
Of course, wdieii one is first introduced to a netv scab 
sort the iiiimbers in the scale convey no nieaniiig, 
matter of experience to attach the right value to the 
liiairiiitiicles in the scale. Perfect correlation can be uii' 
froiii the statement that groups are perfectly correla 
deviation of a member tif one always equals the deviat 
the average of the corresponding member of th 
liiukipiied by an assigned constant. If the two 
marriage and wheat prices^ were perfectly (net 
correlated, you would be able to establish some such 
as this : An increase of '1 in the marriage rate h 
found with a diminution of 6d. in the price of wh 
course, such a rigid relation is never obtained unless 
some physical cause binding the two things together, 
ratio of corresponding pairs tends to constancy, the co 
becomes more and more perfect. That must be regai 
definition of correlation. 

Now consider the sum of the products of x anc 
let us write X for - , and T for — . 

CTj (T2 

If there were no correlation, if we selected the ^ 
T which corresponded with a particular small r 
values of X, we should be likely to find a negative 
neutmlize each positive value of Y, and the product 
from that range of X^s would tend to zero, and the 
the number of terms the less the distance of their 
from zero. But directly there is any bias towards 
the positive value of T for this particular range o 
we increase the terms we may still get negative te 
and there, but on the whole we shall get positive te] 
so on, all the way up the scale of X^s. When 
correlation it is clear that the sum of the products te; 
greater than where there is none. Thus it seems 


137 


from first principles that the quamity r thus calculated will 
make a good measure of correlation. 

There is an important caution to be given in the u>e of 
this formula. If, from two series of phenomena which were 
absolutely unconnected, we took a limited number of examples, 
say a thousand, and worked out the value of r, we should not 
obtain exactly zero, or rather the chances are very niiieli 
against obtaining exactly zero, even if there was no correlation; 
and if we took a very small nmnber of examples the chances 
are very much against obtaining anything near zero. As we 
increase the number of samples, if there is no correlation, 
the coefficient will tend more and more nearly to zero. "What 
w^e require before we can use the coefficient is some criterion 
to enable us to know whether the formula is significant, 
or whether the actual number might have arisen if there had 
been no correlation whatever. Such a criterion is given below^ 
on p. 88. 


Ages of Wives 


15 

20 

25 

30 

35 

40 

45 

50 

55 

60 

65 

70 

to 

to 

to 

to 

to 

to 

to 

to 

to 

to 

to 

to 

20 

25 

30 

35 

40 

45 

50 

55 

60 

65 

70 

75 



o 


30 

SC 

<1 


20—25 2 21 

25—30 .. 23 

SO—35 .. 4 

35—40 .. 1 

40—45 

45-50 .. .. 

50—65 .. .. 


55—60 

60—65 

65—70 

70—75 

75—80 

80— 


6 . 

49 9 1 .. .. 

31 49 9 1 

7 29 43 9 1 

2 7 25 36 7 

1 2 7 22 31 

..1 2 0 18 

.... 1 25 

. 1 2 

. 1 


1 

6 1 
23 5 

13 17 

4 9 

1 3 

.. 1 


1 

4 

11 

6 

o 


1 

2 

6 

3 

1 


1 

3 

1 


( No. 2 49 96 97 88 77 65 48 36 24 13 5 

Husbands Years 

I Median age 22-8 25*8 29*3 34*0 38*9 43*9 48*8 53*7 58*6 63*2 68*0 72*i 

Average age of Husbands, 42*16 years. Standard deviation, 12*6 years. 


above 
> 80 


SO 


Wives 


No. 


Median 

Age 


23-1 yean 
26*8 
31*2 
35*9 
40*7 
45*6 
50*2 


.. .. 43 

29 

.. .. IS 

1 -. 10 

1 .. 3 


55*05 

59*4 

63*75 

68*05 

71*60 

76*17 


ATOrage i^e of Wives, 40*11 years. Standard deviation, 12*1 years. 


0c»IBcient of correlation is *96 approximately. 


The numbers given are in every case the nearest thousands. 


A numerical example I have prepared will put the 
calculation in a clearer fight. The table here given shows 
the numbers (to the nearest thousand) of the wives and 
husbands in the County of York in 1901 at various ages, 

p 2 

















in pt^i-4< *4 five jear^. Fm* example, if y( 
tL>-* I* i'-LiiieF* aL^'v fruiii 4li Ch 4-i aiid look along* tlie 1 
will -oe tliat tlicro aro two wive> lietween 25 and 30 
lirtwt'i-ii oil aiitl 35, 25 liotwevii •>) and 40, and so 
iioi pracricable iv deal witli 010,000 eases, and 
ilierefore dealt witli ilie tliousantls only, and appro:* 
til roil pdioni the ealeiilatioii. Tiie fact that the nmiib< 
diagriiially clown the table as they do shows at once t 
correlation. I have taken a ca>e where the correlation ii 
perfect. If we had a table where the correlation was vei 
we sliuuld find the imiiibers distributed in random fas 
over the table. In such a list of figures as this it is n 
praciicable to take the arithmetic average. It is eas 
as accurate to take ilie medians. I have approximatec 
iiiediaris for all the groups, both horizontal and vert 
the methods already explained. To take a particular e: 
consider again the husbands who are between 40 and 4 
of aste. If vou look alonct the list vou 'will find ther 
all 78, and that the median age is 40*7 years. Or if y 
a vertical colaniii, if you choose those wives who are 1 
40 and 45, and look vertically downwards, you will fi- 
there are in ail 77 of them, and that the median age < 
husbands was 43*9. 

The diagrams show the medians graphically. In 1 
the ages of wives are measured horizontally, those of hi 
vertically. Above the middle point of each five-yeai 
is placed a dot indicating the median age of husband 
wives^ ages come in that period. Thus, looking uptvar 
the position of 37| years, the middle age of the group < 
between 35 and 40, you will find that the dot indicai 
median age of the husbands is placed at the 38*9 year 
see that the points so obtained lie very nearly in a 
line. At the top and at the bottom the line bee 
little bit curved, for the infiuences of the lower an 
limits of ages make themselves felt. If we tried 
normal distribution for the group of wives who are r 
we should be getting husbands at 13 and 14 years of i 
at the other end of the scales we should have got h 
at ages at which there are no people alive. The fact 
scale is limited at both ends is the cause of the defle 
that curve from the straight line. We have now to 
the inclmation of that line. In the case I have takei 








i|ii» cr-irrelatifiii i> there i- no difficulty in mea: 

the liiir, because if a <tniiirbt line i< drawn tliroiigli tli: 
four of those points it pas>e5 very near the otters. J 
other ea-e>, it is not so obvious which straight line is 
ilrawii ; and then we can proceed by the method of 
sc|eares already takeii^ or you can proceed by the foil 
practical method which yields good results :—Mark on 
lines horizontally and vertically through the averages i 
two groups, and rotate a ruler through tliedr poi 
intersection until the same liumlier of dots is found on tl 
side of it as on the other. It will be found that that n: 
gives a deffiiiite position of the line which passes very ne 
points ; it is a purely empirical way; but as the coef 
of correlation need generally not be calculated 
great minuteness, it will in general be sufficiently cc 
It is often absurd in cases of probability to w’-orl 
the results with very gi’eat accuracy. The line i 
drawn in the diagram above, because it would 
obscured the dots; but underneath is given the ta 
of the inclination to the horizontal of the line which 
satisfy the conditions, the tangent 'of this angle i 
The second diagram is constructed in a similar way, f( 
median ages of wives, whose husbands are in a 
group ; the tangent of the inclination of the line th 
the points is now * 92 . The average age of the husba 
42*16 years, with standard deviation 12*6 years; the 
age of the wives 40*11 years, with standard dev 
12*1 years. The statement we have now^ ob¬ 

is of this sort:—If w^e are dealing with a man wliog 
is h, in excess of the average, and we wish to 
the age of his wife; the value of w in the eq 
42*16=*92(ic—40*11) is nearly the most probable va 
her age. That comes at once from the geometry of the s 
diagram. From the first diagram we obtain simila 
Given the age of a woman as being ic, so that the de'^ 
from the average is tr—40*11, then the median age < 
husband group is 42*24+*97(w—40*11). We shall pr( 
also need to know the curve of frequency for each of 
groups. Unless there is a reason to the contrary, I th 
general that we may assume that the curve of frequenc;) 
selected group is similar to the curve of frequency £< 
whole group from which it was selected. So that w 


71 


calculate the standard of deviation for this particular curve uf 
frequency when you know the standard of deviation for the 
whole gi’oup. That is to sa}q w^e can ascertain the chance 
that the age of the husband of a particular woman is 
any assigned number of years above or below the age here 
selected. In the little table given above we can actually find 
these small curves of frequency; for instance^ in the ages of 
mves 30 to 35 years of age the curve of frequency for the 
husbands goes as follows :— 9 ^ 49 , 29 , 7, 2, 1. 

The above is the graphic way of working out the question, 
We have now to show its relation to the formula for r, the 
coefl&cient of correlation. The quantity in the equation 

y=zra-^lai is the quantity evaluated by the diagram as *97. 
that is the tangent of the inclination of the line to the axis of j*, 
where ages of husbands and wives are measured on the axes 
of tU and y respectively, and cji, are the standard deinatioiis 
for wives and husbands. If I had reduced all the measure¬ 
ments to the standard of deviation beforehand, -r, the coefficient 
of correlation, would have been the tangent of the inclinatiiui 
of the line. The question 'which is the easier to -work, decides 
which of the tw'o methods you adopt. If you work it as 
I have done, with the same scale of years vertically and 

horizontally, you would have to say that r=*97x“,and 


from the lower diagram that r='92x whence r is the 

^ <Ti 

geometrical mean between *97 and *92, between the tangents 
of the inclinations of the line calculated on the t-wo different 


hypotheses, namely, *945; and 


V-97' 


Now let us proceed to calculate r by the formula - —. 

It is, of course, a long business, and I shall not give the work 
completely; I shall only indicate the way in which it was done. 
The problem was to find that product for 610,000 pairs, which, 
of course, is a prohibitive piece of work, and cannot be done 
accurately, because the ages are not given except in 5 yearly 
limits. We proceed by approximation. First of all, I neglect 
all the numbers below 1,000; secondly, I assume that the 
numbers left are at the middle of their respective groups. 
Then I deal with the 60 or 70 numbers in the table on p. 67 in 
the following way. Select a group, e.g., wives whose ages are 



•2.1 to 30; t!ie iiiiilclle, 271, is 12*6 years below the avera^ 
Iff all m'ives ; express iliis and other cleTiatioiis in terms > 
.'-taiidard deviatioiij 12*12 ; r2*6“r-12*12 = 1*04. That is 
term to be applied tlirooghuut tliis group of husbands 
iliroiitrli a similar process for the c'f^s. This 6 is at the i 
of the group 20-25 years, namely, 22|, which is 19f 
below the average age of all litisbaiids, and that in tei 
the standard deviations is about T6; ivork out the 
deviations, which are in arithmetic progression, ii 
same way. Then multiply the numbers in the group, 6, ^ 
7, 2,1 each by its deviation, add, and multiply the sum 1 
deviation 1*04 for the group. 

Example : 


Age of Wives, 25—SO 

Bi3tam:e feom Average-1D4 of Stakdaed Deviation 


A^'e o! 
Hiislmiid.'t 

DistiiiieH from 
Average 

Xo. 

Product 

2i>25 

-1*6 '' 


6 

- 9-6 

*25-30 

-1*2 

7 = 

49 

-58*8 

30-35 

- *8 


31 

-24-8 

35-40 

- *4 

' # > 

7 

- 2*8 

40-45 

+ *03 


2 

+ *1 

1 45-50 

+ *4 


1 

+ *4 


Sum — 96’5 

Correspoiidiag jmrt of 2 —. — is—1*04 of—96*5== + 103*6 

(Ti O':] 


Some of the resulting terms will be negative unle 
correlation is considerable. Add these terms, and divi( 
sum by n (in this case 602), and the coefficient of corre 
*96 is obtained. In the method I suggest using we c 
deal with any large numbers at all. The number is r 
from the geometric mean of *945 found by the graphic n 

above. Also—=*96 (instead of *97 as reckoned a 
cr^ 

r X - =*92 (the same as above), r x — = T0 (instead o 

<Ti CTi 

Justification of the Poemula foe r. 

In the method of finding the formula for r on page ' 
used the method of least squares without examinii 
suitability. I will now give reasons, which have not. 



as I know, been previously offered, in favfjiir of tlii> iiieihoil. 
If tlie figures we are dealing with belong to the iioniial riirve 
of error, there is no difficulty. If their curve of frecfiieiiry 
has any other form, still the averages of selected groii|i>, 
represented by the dots in Diagrams XI and XII, are govenied 
by normal curves of error (see page 56). Let 
be the averages of groups of containing respectively 
l‘i, ^* 2 , • . . hm items, whose «r values are a, . . . so that 
the hr which, in the grouping of adopted, are in a small 
group whose centre is Xr, have ^-pairs whose average is 
Let y = a^x-{-h be a line which contains the values of y from 
wliich the observed ^ 2 , are delations. Then 

(yr—cLXr-'^ is a quantity whose frequency-curve is 


^ — f 2 , where c the modulus is inversely proportional 

C V TT 

to \/hr, hr being the number in the Xy, yr group. The 
probability of such deviations occurring together is (as on 
page 46) a maximum, when Xlv i^—axr—hf is a minimum. 
Equating the partial differentials of this sum with reference 
to a and h to zero, and remembering that 2/:ryr=0 = 2lv 
if the deviations are measured from the general average, we 
have, as on page 65, 6=0, and Shr Zr^r) =d. Hence, 

a='S,hryr^^r-^'^hrx/='S^xy^^norl^,^yh.ere the summation extends 


over all the pairs. Then, as before, r = 


do-1 


:xy 


0*2 WCTjO-g 


Consideration of the nature of the formula will, I think, lead 
to the conclusion, that the coefficient of correlation calculated 
by the formula is a good measurement of correlation, whatever 
cm'ves of frequency you are dealing with; and it is surprising 
how very rapidly a small extent of correlation makes itself 
felt, even when you deal with quite a few examples. If n is 
only 20, you will soon find whether there is correlation or 
not by this formula. If you select groups where there is no 
correlation the criterion, discussed below, shows that the 
correlation is not significant; but directly there is likely to be 
correlation between the groups, this formula for r shows it. 
The coefficient of correlation can be used then in a very large 
region of cases in which it is required to test the connection 
between two series of phenomena. In particular, it can be 
used to decide whether two series of phenomena are entirely 
unconnected or not, which subject necessitates a preliminary 
treatment of the nature of series. 



MEASUREMENT OF SERIES. 


SIXTH LECTURE. 


Seeies. 

I PROPOSE to deal in this lecture, first of all, with series 
in general, and then with the comparison of and correlation 
between two series. By a series I understand a list of 
numerical events recorded at regular intervals, for example, 
recorded once every year. In representing a series by a 
diagram we measure time on the horizontal axis, and 
dividing it up into years, we erect an ordinate at the point 
corresponding to each year, to represent on a snitable scale 
the magnitude at that particular year. The question whether 
we should represent these magnitudes by dots or lines or 
rectangles is important, but it is decided on the principles 
discussed when we were dealing with the representations of 
groups, and we need spend no more time on the analysis now. 
Perhaps the most natural way of representing such series is to 
erect a series of rectangles whose areas are proportional to 
the successive magnitudes; but if we leave the diagram in 
that form it will not be very clear, it will be very ugly, and 
certainly this is not a neat way of finishing the representation. 
The next step is to draw a continuous line to replace the 
rectangles; the commonest way of doing this is, to mark the 
middle points of the tops of the rectangles, and join those 
points by straight lines; but this method is erroneous, for the 
same reason that it was erroneous in the representation of a 
group. We need to draw a continuous line so that the areas 



i’J 


contained in the rectangles in the iir>t place, and lij the 
curved trapeziums in the second^ shall he equal in every t^ase; 
but this more correct curve is practically coineidtuit with the 
erroneous straight lines ; it makes very little difference in 
practice which of the two we draw • they give certainly the 
same optical impression. however^ we do replace the 

rectangles by a continuous line we are making an assumption 
wliich is sometimes justified^ but sometimes not; by the fact 
of drawing a continuous line we give the impression that the 
event represented is continually taking place. This is correct 
in representations of births,, deaths^ and marriages^ and it is 
partly correct in representing imports and exports by curves 
but it is not correct in the representation of events which 
only occur once each year. These are details which are easily 
analysed. 

Classification. 

The series^ or the curves which represent them, can be 
di^dded into three main classes : periodic curves, symptomatic 
curves, and others,* or instead of others,we may say curves 
with random fluctuations. Periodic curves are those where 
similar fluctuations recur at equal intervals of time, as the 
annual fluctuation of temperature recorded month by month. 
Symptomatic curves are those which have a definite tendency up 
or down, a symptom,^' though short periods may obscure it, 
as the death rate since 1870. A curve, which is neither 
periodic not symptomatic, may often be regarded as having 
random fluctuations about a stationary average, as a curve 
representing the annual averages of any meteorological 
phenomena, such as average temperature year by year. In 
the Diagram XIII* all four curves are symptomatic; the first 
three are downwards, and the last upwards for the first 30 
years and then nearly level. The series represented in 
Diagram XIY has apparently random fluctuations. These 
curves are not periodic in any strict sense. 

Periodic Curves. 

The first thing to discuss is, how to disentangle the period 
from the symptom when a periodic curve is also symptomatic, 
or how to measure the period if the curve is not symptomatic. 
There is not space to discuss the matter completely, and I want 
rather to indicate the methods, and leave their consideration 

* Seep. 81, 



76 


to file reader. A curve often suggests two tilings: firsts that there 
i< a reLnilar period, and, secondly, that there is a movement 
apart from the period. Assume that we are dealing with 
moiiilily observations and an annual period. To obtain the 
iiioveiiieiit apart fruiii the period, take the averages of the 
12 months of each year and mark them on the diagram 3 
these points would show the average rate for the year, when 
the readings of the vertical scale have been adjusted. But 
there is something arbitrary in beginning the year at the 1st 
of January. The deaths, births, and marriages, and any other 
figures we deal with are probably independent of that 
particular beginning of the year, and if ^ve make comparisons it 
may be better to take other periods to start with; for instance, 
the fiscal year begins on April the 5th. We want a continuous 
representation, which we can obtain as follows :—First take 
the average from January 1st to December 31st. Then the 
average from the 1st of February to January 31st, and so on 
until we get 12 dots every year. It is clear that the curve 
through these points cannot have any sudden fluctuations; the 
curve so obtained shows the symptom when the period is 
eliminated. The theory underlying this method is quite 
simple. If we take any particular 12 months, we shall 
include the whole influence of the period, the excess in one 
part and the defect in another, and if we average 
them we shall probably get the number which would 
have occurred if there had been no period, and if 
the flow had been regular. It is approximate only, because 
the various small fluctuations will affect the average, and it 
can he improved by smoothing the curve. If the series is not 
symptomatic the resulting smooth curve should be a horizontal 
straight line. 

Now, in order to measure the period as apart from the 
symptom, the only method is to write down the rates for the 
50 Januaries which -we may be dealing with, and take the 
arithmetical average, the mode, or the median of these; to 
repeat the process with the Februaries, and so on; and then 
to represent the successive averages for the 12 months by a 
separate curve, which is best drawn with a base line through 
the general average of all the data. We thus get such a 
curve as that given by the graph of y=sin a?, from 0® to 360®. 
The justification of the method is simple. In the 50 Januaries 
we include one January from each part in the symptomatic 



curve. All tlie excesses due to tlie .^ymptoinaiic teiideiicv will 
be counter-balanced by tlie defects, or will tend im bt^ vuimivT- 
balanced by tlie defects, due also to sjiiiptyinatic tendeiicv. 
They ^yill only tend to be counter-balanced; fur if we take 
the 50 Januaries we include among them some extraenliiiarv 
months, and some months whose deviation from the annual 
average is quite small. The accuracy with which we may 
expect to get the true January reading is proportional to the 
square root of the number of rimes taken, from the theory of 
averages discussed above. In carrying out the method, we 
implicitly assume that the causes wdiicli decide the symptom 
and the causes which decide the period are independent, 
Avhile generally they are not independent. If there is an 
increasing death rate or an increasing want of employment 
at the same time that the winter is especially severe the one 
will accentuate the other. It is very easy to see how the 
result may be affected. Suppose some industrial disaster 
throws a great proportion out of work in August in one year, 
so as to increase the percentage of unemployed, we will say 
to 50, then when taking the average for ten years, that 
figure alone gives a rate of 5 per cent, in August, whereas 
the excess had nothing to do with the fact that August was 
the month concerned. If you take a sufficient number of 
years, however, those things will tend to equalize one 
another, and if we use the median instead of the arithmetic 
average extraordinary occurrences have little effect. For 
this reason it is best to estimate the period from the medians. 
In the end we shall not get a smooth curve for our averages, 
and may have to smooth that by a trigonometrical function, 
or by some other method. 

Symptomatic Series. 

We will now discuss the symptomatic curves; the top 
curve in Diagram XIII (male death-rate) will do as well 
as any as an illustration, for the method of dealing with this 
curve applies to a very great number of such curves. All 
statistics representing sociological phenomena that I have 
had experience of are symptomatic. Perhaps in very rare 
cases you will find no symptom, but in general there is a 
symptom; however remotely connected the figures are with 
the general progress of civilization, you will find there is 
some symptom up or down, or alternately up and down. In 


geiienil we iiiav assume a symptom in ali figures rt 
Immaii society. In dealing wfitli sucli curves, we s 
want to examine them in detail for a short period; 

ive are more concerned with the symptom, espi 
forecasting events. In curve A in Diagram XIII 1 
considerable and rapid fluctuations, but there is alsc 
optical evidence of a fall in the rate beginning betw< 
and 1870. The causes which produced the actual si; 
ordinate are, of course, very many, and it is impc 
draw the line betw^een those which tend to make a 
permanent change, and those which tend to make 
temporary change. It is a question of degree an- 
character, and for that reason alone it is impossible 
any theoretic solution for distinguishing the sympt 
the small fluctuations, just as it is impossible to ^ 
general solution to the interpolation problem. We h: 
to find an empirical solution, one that satisfies our ii 
needs. It might appear best to draw’ a straight lin 
on the w’hole shall differ from the observations as 
possible, and w'hich could be determined by the m 
least squares ; this would assume a symptomatic ten* 
equal increments or decrements in successive years, 
might assume a parabolic curve or logarithmic ci 
recent American writer has assumed that a certain S€ 
be represented by y=:Jcx^, the compound interest ( 
But I think in general there is no reason to ass; 
definite algebraic law. The solution I should sugge 
a commonplace one—is similar to that I have just sugg 
the removal of the period. It is most easily unden 
an example. 

The figures in the following table are from the E 
GeneraFs Eetums, or are calculated from the Si 
Abstract. 


Imports per Heai*. 


MAEEIAf.E Raie 
PEK 


llkZ- ,.!■ 

Mai.e.- I i.v ; ’A*” 


3*30 ; 


17*2 


21*7 


3*15 


17*2 


23*9 


3*21 

-*13 

15*8 

— •7 

25*5 

t 1-4 

2*91 

-*44 

15*9 

-*6 

23*8 

- -3 

3*52 

-•03 

16*2 

-•3 

25*8 

• rl-9 

3*97 

-^•19 

17*2 

4*4 

21*4 

- 2*0 

4*14 

-*16 

17*2 

■ 

22*8 

- *6 

4*35 

— *35 

17*4 

•0 

23*2 

T *1 

5*51 

■ f*58 

17*9 

4*7 

^ 23*8 

T *3 

5*51 

- f*17 

17*2 

4*1 

1 24*4 

41*2 

5*16 

-*64 

16*2 

— *7 

23*5 

4 *4 

6*16 

- r*30 

16*7 

4-'2 

21*3 

-1*8 

6*66 

4 - *65 

16*5 

*0 

22*6 

- *3 

5*80 . 

-•64 

16 * 

— *7 

23*9 

^ 1*3 

6*26 

— *45 

17 * 

4 *4 

23*3 

4 '4 

7*32 

- f*40 

17*1 , 

4*6 

22*1 . 

- *8 

7*50 

-T ’Oo 

16*3 

-•4 

22*7 : 

— *2 

7*72 

-*33 

16*1 ; 

-*6 

22*4 

’ - *8 

8*45 ' 

■4 * 0t > 

16*8 ■ 

*0 

24*1 ^ 

i 4 *4 

9*26 , 

4*40 

17*2 ; 

4*2 

24*9 

+ *8 

9*06 ! 

~*06 

17*5 1 

4*4 

24*5 

4 *3 

9*80 i 

4 * 4o 

17*5 ; 

4 *5 

24*6 

! 4 *6 

9*05 

-*36 

16*5 i 

— *2 

23 * 

^ - *8 

9*60 

4*06 

16-1 * 

-*3 

23*1 

, - *8 

9*54 

-*30 

15*9 I 

-•4 

23*6 

( -u 

9*70 

-•35 

16*1 1 

-•3 

24*2 

4 *7 

10*49 

-•15 

16*7 ! 

•0 

23*9 

! 4 -6 

11*13 

. 4*28 

17*4 j 

4*4 

22*6 

1 ~ *7 

11*54 

4*35 

17*6 ; 

4*5 

22*4 

1 - *9 

11*39 

' 4*04 

17 * 1 

•0 

23*6 

4 *6 

11*39 

-*(^ 

16*7 ; 

•0 : 

24*1 

1 ^ 1*3 

11*30 

-•04 

16*5 

4*3 ! 

22*3 

1 - *6 

11*75 

4*57 

15*7 i 

•0 

21*7 

- *9 

10*87 

-•41 

15-2 ’■ 

-•1 

22*9 

4 *8 

10*59 

-•70 

14-4 1 

— *7 

22 * 

4 *3 

11*88 > 

4*59 

14-9 i 

-*1 

21*8 

4 *3 

11*37 1 

-•15 

15*1 1 

-0 

20 * 

- 1*1 

11*73 

4*14 

15*5 

4*3 

20-7 

- *1 

12*04 ? 

4*77 

15*5 i 

4*4 

20*8 

1 4 *3 

10*92 ! 

*0 

15*1 1 

4*1 

20*9 

4 *2 

10*30 

-•26 

14*5 

— ’2 

20-3 

- *3 

9*63 i 

-62 

14*2 ^ 

-•3 

20*6 

4 *4 

9*90 

-*47 

14*4 

-•1 

20-2 

4 *3 

10*51 

-*04 

14*4 

-•3 

19*2 

- *8 

11*50 

4*57 

15 * 

1 -0 

19*3 

- *9 

11*22 

4*05 

15*5 

4*3 

20*8 

4 *6 

11*52 

4*34 

15*6 

4*4 

21-5 

41*1 

11*12 

4*14 

15*4 

4*1 

20 * 

-0 

10*53 


14*7 


20*3 

... 

10*53 


15*1 


17*6 



standard 

Deviations cTi = *38o 
2Xiar2=4*236 
4*236 

4 oO*iOr 2 


ar^ — *37 

2 ^ 13:3 = -3*^ 

. -3-207 

46 <riO*s 


cTg =*830 
-2*73 

.= ^2;73^_.^g 
46 <rgO*s 




80 


Take tlie last group of figures, tlie death rate of 
I take the average of the first five death rates, 20*1 
to 24*4 in 1849,"namely, 22*0, and place in the pen 
eoliiiiiii at the niidille of the period, namely, the ye 
I beffiii a^ain at the second year, 1846, and take the 
for 1846-1850, namely, 22*6 again, and place that at I 
middle year of that period; and so on for 46 si 
periods. Then on the Diagram XIII, I have represen 
line of moving averages by the dotted line running 
the continuous line. I think it is clear that that line o 
solution of the problem. In taking the average of 
years we are equally likely to include the ups and d 
their fluctations. If there was a regular period, 
fluctuations were five-yearly we should remove them 
ill five years, it 'would be the obvious time to take, 
were dealing with figures referring to industry and th 
was ten years, ten years would be the most appropriat 
of time to average, including as it would one contribut' 
each part of the fluctuation. If there is no regula 
there is no rule to be given as to wdiat number of j( 
shall take; it is a matter of convenience. If the fiv 
average gives you a curve with sharp angles and ap 
random fluctuations, increase the number of years, k 
convenient to work with an odd number of years, 
middle of the period then coincides with the middle o 
the years; but, on the other hand, a period of ten jet 
arithmetical facilities. This method may, I think, 
for consideration; I believe it will be seen that it 
solution of the problem. To complete it, I rec 
replacing the dotted line by a regular curve drawn v 
to it, smoothing out any little fluctuations which 
A curve thus drawn would fall from 1847 to 18 
rise for about seven years and then fall, fairly ra| 
about 1882, and more slowly afterwards. In the n 
things we cannot fix exact years for the end of the ris< 
It is absolutely necessary to have some such me 
measuring the symptom before you can base any argu 
to the change in the quantity measured. That 
important. For example, the curve D, which re 
imports, is a sharply fluctuating curve with a partial 
If, to take a particular date, we had in 1879 looke( 


England. 



Q W3 O M3 

CM - qH 

Diageam XIII. 


A Deaths per 1,0CX) living—Males. 

B „ „ ,> » —Females. 

C Persons married to 1,000 living. 

D Value of Importe per head of the population of the United Kingdom. 


G 


previous two years only we slionlcl liave tlioiiglit tlic 
rapid fall in tlie average imports; but if we looke 
liistory of tlie plienomeiioii ive should have seen 
appeared to be only part of a minor fluctuation^ and 
we should have seen that average imports had been s 
on the whole for eight years. 

It is not possible to say at the iiioment whether a 
a periiianetit nature or simply one of those little flu 
which characterize the phenomenon throughout 
century. For instance, by 1908 we can perhaps judc 
tendency in 1900, but cannot judge of the curr 
because we have not enough information. 

The delations obtained by subtracting the insta 
average from the figures for each year are given in 
column. The deviations for the first three groups o 
ill the table are calculated on a similar method, 
deviations should have some affinity to the curve 
Great deviations should be rare compared with small de 
and the occurrence of small and great deviations sho' 
some such relation as the occurrence of great a: 
deviations in the curve of error; but the agreemei 
likely to be close, for the deviations calculated here 
independent one of the other; they are bound togeth< 
fact that the same number is used in forming five si 
averages, while the curve of error assumes that the tl 
absolutely independent. 

COEEELATION BETWEEN SeEIES. 

That is a very rapid discussion of a rather wide 
but I must lead on to the correlation between tw( 
figures. If we were deahng with a curve with no , 
and no period, for instance, two sets of figures relatin 
weather, ajj, aia • - • representing the average tern' 
yi} ys • • • yn representing the average wind velo( 
<X)rrelation between these two should be calculated as 
described. If we were dealing with a periodic ( 
should replace the periodic curve by its line of 
before comparing it with another curve. If the 
irregular period, then I think we should proceed as if 
a symptomatic curve with no period. Of course, 
pericniic curves with the same period are correlated, 
sequences of events which are influenced by the 


83 


elianges in tlie weatlienrill gire a stroinr degree of correlation 
quite independently of anything else. That a quantity 
which ill general will not be worth iiieaMiringq lioi when we 
eoiiie ro very irregular periods such as those wliicli we fiiifl in 
trade statistics, it is wTjrth nieasiiriiig the correlation even 
through the periods; because it is not so obvious^ for instance, 
tliat all the fluctuations of exports are correlated with all the 
fluctuations of imports, and that the two together are 
correlated with the amount of employment. 

A difficulty arises in dealing witli many curves from the 
fact that the successive deviations year by year are not 
altogether independent. Many curves which deal with 
sociological plienoiiieiia have fluctuations each of which 
extends over several years, so that a rise in one year is more 
often followed by a rise in the next than by a fall. Other 
curves have the opposite eharacr€‘r 5 that an excess in one year is 
follo’wed by defects in the other; for instance, if there is a 
gi*eat death rate in one year we may expect a comparatively 
small one in the follow'ing; and thi> ab-eiice of independence 
should be kept in mind when -we have to base argiiiiients on 
the resulting correlation. But apart from tlii>, we could treat 
the deviations fruiii the iiioving line tT ai'erages as deviations 
■whose correlation Tre can fairly calculate. 

There is a very great difficulty in ivorking out the correla¬ 
tion betw'een symptomatic curves. If we do not take the 
de%nation from the line of averages, but take the delations 
from the average for the whole 50 years, any two symptomatic 
curves will show correlation. If we take tivo things which 
are absolutely disconnected, except that they are both 
phenomena arising in the progress of society, and work out 
the coefficient by the straightforward rule, we shall find there 
is some correlation. If two curves have short fluctuations 
which are correlated, but opposite symptoms, then owuiig to 
the symptom apart from the fluctuations there would be 
negative correlation, while owing to the fluctuations apart 
from the symptom there would be positive correlation; and 
when both are taken into account the correlation may be 
positive, zero, or negative. It is therefore necessary to treat 
the symptom separately from the short fluctuations. On the 
whole there is not much benefit in measuring the correlation 
coefficient for the symptoms; we should rather simply state that 
the symptom is say 15° upward in one case and 10° downward 



ii’ i!if' *:'rlier. The ii>efiil iiiea>iiremeiit of tlie cor: 

twn -ueli ciirve> tliat of tlie sTOiptoms 

tlie fleTiaiitjii-- 

Aiis^tiier i|iie>iioii wliicli arises yerj often in a p 
way i", wlietlier wt sliooM compare tlie deviation 
wiiole say imports, with the deviation for tli< 

say tli€^ marriage rate, in the same year, or in the iie^ 
Can w'e correlate the imports of 184/ with the marria 
of 1847, or should it be taken in comparison with 
184r“r That qiiestioii will often occur, especially 1 
marriaand birth rates. Mr. Hooker has sugges 
sagtjestioii which has been made independently in A 
that we sliuukl work out the coefficients of correla 
the hypothesis of synchronism, and on alternate hy| 
that one event follows half or one year after the otl 
see which correlation is the greatest. In this way we 
get a series of correlation coefficients according 
dates we take. 

Before we proceed to measure correlation by mathe 
formulae we should observe it purely graphically; j 
graphic representation of series will often suggi 
existence of correlation, which can then be measured 
mathematical formula. The curves A and B in Diagra 
are obviously closely correlated. In the curves B ai 
cannot decide from the figure whether there is correh 
not ; at any rate the evidence of correlation is not s( 
In the curves B and D, I do not think we could decic 
the figure as drawn, though we might perhaps from 
drawn in a different way, whether there was correh 
not. 

Let us proceed to discuss how to put two curves c 
as to get optical evidence as to whether they are co 
or not. Instead of measuring figures as in Diagrai 
measure as in Diagi’am XIV. Plot out the de 
calculated on p. 79 above and below a base line 
senting zero; but before doing so it is necessary to 
the relative scales of the two quantities so as to 
definite relation the one to the other. There is no 
way of comparing pounds sterling with one per thou 
the marriage rate. The way which naturally suggest 

* Journal of the Moyal Statistical Society, Sept. 1901. See 
pp. 490--1. I think that Mr. Hooker was also the first to publish a c 
of TOirelatiou based on deviations from a moving average. 


I 


I 



M — O — CM ^ 


and it is verj useful for making the optical evidence of 
correlation Yi\dd_, is to represent the standard deviation for 
each group by unity on the vertical scale. The standard 
deviation for the death rate of males is 830; of females it 
is *803 j so we represent the deviation *830 for males and 
•803 for females by the same vertical line. If we were 
doing the same thing for imports and male death rates, we 

G 2 




86 


'-li'-til’ll rt‘‘pre>eiit l^y tlie .<ame vertical scale ‘886 of a 
sirrliiiiT and *^8 ileaili rate. This meiliocl lias been app 
lliaL'‘n«i] XIV, for ilie comparison of lines A and 
Diairniin XIII. 1 liave per in tlie death rate for 

ivpiv-fined by the zi^rzag liiie^ but I found I could not 
I lie death rate fur females in the same way and ma 
lines distinct^ and therefore have drawni short lior 
line.N to the death rate for females year by yea 

and lines representing the two series are in nearly 
year close together. The optical evidence of correla' 
very great indeed. 

Tie illustration taken is of two series where tli 
relation is nearly perfect; in less perfect cases we c 
evidence by noticing whether the maxima and minima 
Jit the >aiiie dates in the two series. For example^ if w 
the value of exports and percentage of uiiemploy 
should find perhaps that the maximum of the one came 
same time as the miiiimiim of the other throughout^ an 
would give strong optical e\ddence of negative eorre 
A method of testing whether there was correlation c 
which would naturally suggest itself to anyone who 
small knowledge of probability^ would be to see how o 
positive deviation of the one agreed with a positive de^' 
of the other; how often like signs concurred and hov 
unlike signs concurred. If we wrote down 50 + and — 
at random and another 50 alongside,, the chances of g 
various numbers of agreements are easily calculated^ a] 
in fact the successive terms in the expansion of (i + i)' 
that is not a good method, for it does not take into ai 
one of the most important considerations, whether a 
fluctuation of the one corresponds with a great fluctuai 
the other or not. We should get equal evidence of corr( 
by this method when we had a resemblance of this s 
tivo curves where great fluctuations corresponded with 
and small with small throughout, and when the correspoi 
was in sign only. Those things are obviously not ( 
same importance, and so the method of merely counting 
will not take us very far. 

We will, then, proceed by the method of calcu 
correlation described above. Referring now to the tal 
p. 79, it should be remarked that the method of evali 
the value of imports changes at the year 1852; I h 


approximate before tliat year from the value- the expimt-^ 
because tlie figures given by tlie Board (A Trade before and 
aft€U* that dare are calculated on clilfererit metli! and 
mji comparable. Otherwise the tigares of lutal valia^ ct 
imports for home con so nipt ion to the United Kiiigdfiin art* 
comparable. To inj mind the imports are more .-igiiificaiji 
than the exports; and also it seeiii< to me absurd to add 
imports and exports; I do not think you can add iliein 
together any more than yon can add bread to liiitter. 1 have 
taken the imports onlj^ aiid^ without criticising the figures in 
detail, I have divided by the gi’oss population as given in tlie 
Statistical Abstract. Thus we get the aniyunt per head 
given in the first colinmi in the table. I have only intended 
to work to the second place of decimals. The death aiifl 
marriage rates are taken from the Registrar-GeiieraFs Kepurt 
for 1895, which gix^es the figures for the prt‘vious 50 years. 
The standard deviations given at the bottom of the table are 
obtained by taking the square root of the sum of the squares 
of the 46 deviations given, divided by 46, as in the ordinary 
formula for >taiidard deviation. The standard deviation i> 
essentially an absolute qiiaiititj ^\fithoiit sign, 

I have calculated the coefficients of correlation between 
groups 1 and 2 (imports and marriage rate), between grjuip- 
1 and 3 (imports and death rates), between 2 and 3 (marriage 
and death rates), and between 3 and 4 (death rates for males 
and females). I have intended to choose cases where, a ‘priori 
we might expect small correlation, no correlation, and great 
correlation. A priori we should expect correlation in the 
positive sense between imports and the marriage rate; not 
that increased imports cause an increase of the marriage rate, 
but the causes which produce prosperity are likely to have 
effect ill increasing both imports and the marriage rate, the 
complexus of causes which decide the two things have 
something in common. The coefficient is *65. The marriage 
rate and death rate have presumably very little in common. 
One certainly could not say to start with whether an 
increasing death rate would synchronize wuth an increasing 
or with a diminishing marriage rate. The correlation between 
the two is —*19. The correlation between the imports and 
the death rate is — *22. The correlation between the death 
rate for males and that for females is -f *99; it is practically 1, 


liUi :ht‘ Liiiui'er I ran miiIv hr Hi^raiiirr! if iliriv is an 
al>^^>ni!r proiJ-»iii’*n all tlmaigli tlir sralej wliieli tliere i> iiui in 

ClillEEION «>F IHE CuIIKELATlUX 

Cj^efficiext. 

'Sow wr art* fare fare witli tiie tiiiestiuii^ "What tie 

liir-i* iiiiiiierirai values iiieain aiitl wliirli of tliein are 
.-iL^iitiraiii *r It i> rlear iliat .-oiiie >urli question arises^ 
if wr wrlre down two >erie.s aljsoliitely at raiiduni 
ami wiii-k Mill ilieir formuhe tlie cliances are very iiiucli 
anniiimi y-:nr ohrainiiiu zeru^ and there are heavy odds against 
ohiaiiiimr a .-niali iiiiiiihrr. Xow the chance of obtaining a 
roetficierit near zero iiitrea.*5e> with the number of terms. If 
we have rwij -erie>, m-j . . . and Vj, zn . . . I'n^ measured 
from their average>j and we select a group of r’s which are 
near to uiie anutlier, the ?hs which wdil be their factors in 
foniiiiig the sum of the products are equally likely to be 
po>itive »jr iieffative; if we had an infinite number of these 
deviations their Mim will be nothing; and the sum would 
lend hi zero if we increased the niiiiiber of ternis^, the actual 
deviation from zero being in inverse proportion to the square 
root of ti, the number of terms. Hence the number of terms 
taken lias imicli to do with the significance of the resulting 
coefficient of correlation;, and we should expect that the 

quantity would enter into the measurement of the 

V fl 

significance of the coefficient of correlation. It is a little 
difficult to state and explain the measurement of the criterion 
of the significance ; but it is absolutely necessary to make the 
attempt. Of the coefficients just giveiq the first and fourth 
are found to be significant, and the second and third not, 
when tested by the theoretical criterion. 

Suppose we take two correlated gi‘oups, and that there is, 
as a matter of fact, a definite value for the coefficient of 
correlation; and then suppose we take 50 samples from each, 
that is to say, 50 pairs of events, w^e shall not naturally obtain 
exactly the coefficient of correlation that belongs to 
the wffiole groups. The chances are against obtaining 
exactly that result. Xow, the deviations from the actual 
coefficient of correlation which are obtained by taking samples 




and finding the correlation have a eiirve of frcf|iiei]ev 

1 

y= ^ e where 1/IS the probahiiitv that the cHotfieieiit 

C V TT 

obtained differs by .t’ from the true cuefficieiit^ and r, the 


modulus = (1 —?*2)«where r is the result obtained from 

the sample group^ which consists of n pairs. The probable 

1 — r- 

error in this curve of error is *67 of - . For example, in 

V a 

the coefficient between imports and the iiiarriage rate, = 
the calculated coefficient of correlation is *6o, and the probable 

”” I bo i “ -i 

error tor its curve oi ireqiieiicv is *t>/ x - ! ■ = *Ootj. 

y46 


That is to say, from the calculation itself it is as likely as 
not that the actual coefficient is between *65-f *056 and 
*65 — *056. The chance of the true coefficient being as much 
as the modulus, namely *115, distant from the calculated *65 
is sho\%Ti by the table of the error function to be only 10 in 
100 ; the chance of it being so far from *65 as to be actually 
zero are infinitesimal, for in the curve of error tlie cases 


•where the deviation is as iiiucli as six times the iieHlnliis are 
practically non-existent. So that we have nverwhehiiiiiir 
e\’ideiice, if our general principle of calculation is cnrrect, of 
correlation between the first and the second coluniiis, and the 
most probable value of that correlation is two-thirds. Tii 
other words, the standard deviation of imports being £*387, 
and the standard deviation of the marriage rate *37, 
the most probable deviation of the marriage rate is 
-h|- of ‘37=*24, when we find a deviation in imports of 
-I- £*387, and so on in proportion. This statement should be 
connected with the graphic measurement of correlation 
discussed on j). 70. In the second case of correlation, that 
between imports and the male death rate, where the 
coefficient is —*22 the probable error by the method just 
described is *09. That is to say, our calculation means that it 
is as likely as not, from our evidence, that the correlation 
between these two series is between — *13 and — *31; the 
chance that the real correlation is zero or positive is quite 
perceptible. The chance from the table of the error 


* See Pearson, in Royal Soc. Trans,, A. 175, p. 265; and correction, in 
Royal Soc. Rroceedings, Oct. 18th, 1897; also Yale, in Statistical Journal 
1897, p. 847, 



Ill net lull rliiit n iit'-ifative deriatioii a? great as *22 slioukl 
ueciir '.viien tlie jir.enable error is '09 is aboat one in ten. If 
we Tijsjk ten L^ruiip- wiiicli bad zer\ or sliglit positive 
eLirrekiti'^.ni, in une these irroiips yon iniglit expect to get 
-•iiidi a result n- —’22. Siriiilarlj^ rlie chance that uncorrelated 
L^r^'^iip-' of pair.- give the coefficient found 

betw’eeii inarriag^e and male death rate^ namely, — *19, is one 
in -ix : that i> to say, once in six groups which were not 
fijiniecteil you ivoiikl obtain that apparent correlation. The 
cliaiice-- that you obtain the coefficient of correlation *99 from 
a raTKhjiii group is practically zero. That is to say, there is 
correlati^sn betwt'eii male and female death rates, and it is of 
a nature that you could, given the deviation of death 
rate of males in the year, write do^vn wdth very fair certainty 
the average death rate of females. For example, given that the 
death rate of males was -f *5 in excess of the moving average, 
that tlieii the most probable death rate of females would be 
*80 

^ x*o, or *48 in excess of the average, and it is unlikely 

that anv rate differing at all far from this will occur. 

"We have thus found a way of measuring correlation, and 
of testing the significance of our measurement, between two 
groups and between two series. The method must be used 
with discretion. There is no time to discuss under wliat 
circiimstaiices it is applicable, nor the further developments 
uf the tli£-*orv. 


CoisrcLrsiox. 

In these lectures I have tried to indicate the common-sense 
treatment of curve draiviiig and averages on the one hand, and 
the more delicate and exact method of representing groups 
and series by quantities based upon algebraic work on the 
other. Directly we attempt to use the latter methods, the 
algebraic methods, we find that w'e are bound to make 
approximations that involve the use of the theory of 
probability and the theory of error, and I have therefore 
been compelled to deal with these theories. When I have been 
treating them I have not attempted to promulgate any 
original opinions, I have only tried to illustrate principles, 
which are already laid down, by new examples. But since 
the modem shape of the theories of probability and error is 



91 


new, and involves some matters wliicli are still controversial— 
so far as mathematical reasoning can be coiitrover.-ial— I 
have found it necessary to spend some little time in exaiiiiiiiii«“ 
the foundations of the theories in some detail. I have only 
been able to deal with the beginnings of some of the difficult 
questions which arise, and I am sorry that for want of time 1 
have been compelled to leave out many illustrations of the 
practical utility of the methods; I have had to spend time on 
the theory rather than on the practice. My object will have 
been completely attained if I have succeeded in indicating 
the scope and the interest of the application of the theory of 
error, a subject which urgently needs the co-operation of 
serious students, alike to calculate experimental data, which 
are very much wanting, and to criticize, establish, and enlarge 
the body of theory. 



